improve `new_database` subcommand performance when targeting postgresql db

dhoogest commented 4 years ago

The new_database subcommand works great for sqlite dbs, however when a postgres db is the target, insertion of the large names/nodes tables is extended by an order of magnitude, most likely due to sqlalchemy+psycopg2 executemany nuances. Here's the timing on my fairly wimpy desktop (incl. download):

sqlite

taxit -vvv new_database  208.82s user 27.36s system 38% cpu 10:14.96 total

postgres

 taxit -vvv new_database postgresql://dhoogest@127.0.0.1/ncbi_taxonomy_0_9_0  383.53s user 184.98s system 9% cpu 1:35:59.95 total

So, ~10 mins vs ~95 mins. The following info might be relevant for optimizing executemany:

dhoogest commented 4 years ago

@nhoffman if you get a shot, mind summarizing your comments related to this issue from a couple weeks ago when we discussed? If I recall it had something to do with the way in which postgres handles indices during insert (was not related to the executemany hypothesis proposed above). I think the proposal for was to drop indicies on names/nodes, then insert, then add the indices back.

crosenth commented 4 years ago

Postgres unique index inserts are slow. I will look into a better strategy to add indexing after inserting rows.

crosenth commented 4 years ago

Starting with this blog

https://docs.sqlalchemy.org/en/13/faq/performance.html

I looked at possible bottlenecks in Sqlalchemy Core vs Orm, Postgres and networking.

It looks like we are already doing Core-level Inserts which are as fast as can be with sqlalchemy:

https://github.com/fhcrc/taxtastic/blob/master/taxtastic/ncbi.py#L378

Postgres has performance issues Inserting rows with unique columns:

https://blog.timescale.com/blog/13-tips-to-improve-postgresql-insert-performance/

But Sqlalchemy requires a primary/unique key for every table as part of a Schema. The only option here would be to create our tables outside of the Sqlalchemy schema.

Lastly, I tested possible networking issues by running taxit new_datababse on a local Postgres instance and it finished in only 10 mins:

Screenshot from 2020-08-17 21-38-09

The performance increase here is substantial.

As a side note I tried the Pyscopg2 Postgres adapter and got the same performance:

Screenshot from 2020-08-17 22-16-02

In conclusion, I believe the best path forward is addressing the networking bottleneck. A second path could ber using the sql profiler to possibly optimize the data transfer in our Insert queries.

crosenth commented 4 years ago

Adding @nhoffman

crosenth commented 4 years ago

@dhoogest - Strange that my local Postgres experiment took only 9 mins while yours took 90

crosenth commented 4 years ago

Also my local sqlite experiments took between one and three minutes on my laptop and Unicorn:

Screenshot from 2020-08-18 09-23-20