fhcrc / taxtastic

Create and maintain phylogenetic "reference packages" of biological sequences.
GNU General Public License v3.0
20 stars 9 forks source link

improve `new_database` subcommand performance when targeting postgresql db #139

Closed dhoogest closed 1 year ago

dhoogest commented 4 years ago

The new_database subcommand works great for sqlite dbs, however when a postgres db is the target, insertion of the large names/nodes tables is extended by an order of magnitude, most likely due to sqlalchemy+psycopg2 executemany nuances. Here's the timing on my fairly wimpy desktop (incl. download):

sqlite

taxit -vvv new_database  208.82s user 27.36s system 38% cpu 10:14.96 total

postgres

 taxit -vvv new_database postgresql://dhoogest@127.0.0.1/ncbi_taxonomy_0_9_0  383.53s user 184.98s system 9% cpu 1:35:59.95 total

So, ~10 mins vs ~95 mins. The following info might be relevant for optimizing executemany:

dhoogest commented 4 years ago

@nhoffman if you get a shot, mind summarizing your comments related to this issue from a couple weeks ago when we discussed? If I recall it had something to do with the way in which postgres handles indices during insert (was not related to the executemany hypothesis proposed above). I think the proposal for was to drop indicies on names/nodes, then insert, then add the indices back.

crosenth commented 4 years ago

Postgres unique index inserts are slow. I will look into a better strategy to add indexing after inserting rows.

crosenth commented 4 years ago

Starting with this blog

https://docs.sqlalchemy.org/en/13/faq/performance.html

I looked at possible bottlenecks in Sqlalchemy Core vs Orm, Postgres and networking.

It looks like we are already doing Core-level Inserts which are as fast as can be with sqlalchemy:

https://github.com/fhcrc/taxtastic/blob/master/taxtastic/ncbi.py#L378

Postgres has performance issues Inserting rows with unique columns:

https://blog.timescale.com/blog/13-tips-to-improve-postgresql-insert-performance/

But Sqlalchemy requires a primary/unique key for every table as part of a Schema. The only option here would be to create our tables outside of the Sqlalchemy schema.

Lastly, I tested possible networking issues by running taxit new_datababse on a local Postgres instance and it finished in only 10 mins:

Screenshot from 2020-08-17 21-38-09

The performance increase here is substantial.

As a side note I tried the Pyscopg2 Postgres adapter and got the same performance:

Screenshot from 2020-08-17 22-16-02

In conclusion, I believe the best path forward is addressing the networking bottleneck. A second path could ber using the sql profiler to possibly optimize the data transfer in our Insert queries.

crosenth commented 4 years ago

Adding @nhoffman

crosenth commented 4 years ago

@dhoogest - Strange that my local Postgres experiment took only 9 mins while yours took 90

crosenth commented 4 years ago

Also my local sqlite experiments took between one and three minutes on my laptop and Unicorn:

Screenshot from 2020-08-18 09-23-20

Screenshot from 2020-08-18 09-24-58

crosenth commented 4 years ago

And lastly, here is my Unicorn Postgres experiment to db3:

Screenshot from 2020-08-18 08-34-51

nhoffman commented 1 year ago

I think we've addressed this to the best of our ability in v0.10.0 - feel free to reopen if you have specific ideas for further improvements.