Closed dhoogest closed 1 year ago
@nhoffman if you get a shot, mind summarizing your comments related to this issue from a couple weeks ago when we discussed? If I recall it had something to do with the way in which postgres handles indices during insert (was not related to the executemany
hypothesis proposed above). I think the proposal for was to drop indicies on names/nodes, then insert, then add the indices back.
Postgres unique index inserts are slow. I will look into a better strategy to add indexing after inserting rows.
Starting with this blog
https://docs.sqlalchemy.org/en/13/faq/performance.html
I looked at possible bottlenecks in Sqlalchemy Core vs Orm, Postgres and networking.
It looks like we are already doing Core-level Inserts which are as fast as can be with sqlalchemy:
https://github.com/fhcrc/taxtastic/blob/master/taxtastic/ncbi.py#L378
Postgres has performance issues Inserting rows with unique columns:
https://blog.timescale.com/blog/13-tips-to-improve-postgresql-insert-performance/
But Sqlalchemy requires a primary/unique key for every table as part of a Schema. The only option here would be to create our tables outside of the Sqlalchemy schema.
Lastly, I tested possible networking issues by running taxit new_datababse
on a local Postgres instance and it finished in only 10 mins:
The performance increase here is substantial.
As a side note I tried the Pyscopg2 Postgres adapter and got the same performance:
In conclusion, I believe the best path forward is addressing the networking bottleneck. A second path could ber using the sql profiler to possibly optimize the data transfer in our Insert queries.
Adding @nhoffman
@dhoogest - Strange that my local Postgres experiment took only 9 mins while yours took 90
Also my local sqlite experiments took between one and three minutes on my laptop and Unicorn:
And lastly, here is my Unicorn Postgres experiment to db3:
I think we've addressed this to the best of our ability in v0.10.0 - feel free to reopen if you have specific ideas for further improvements.
The
new_database
subcommand works great for sqlite dbs, however when a postgres db is the target, insertion of the large names/nodes tables is extended by an order of magnitude, most likely due to sqlalchemy+psycopg2executemany
nuances. Here's the timing on my fairly wimpy desktop (incl. download):sqlite
postgres
So, ~10 mins vs ~95 mins. The following info might be relevant for optimizing
executemany
: