fhcrc / taxtastic

Create and maintain phylogenetic "reference packages" of biological sequences.
GNU General Public License v3.0
21 stars 10 forks source link

Improve efficiency of postgres taxonomy database creation #152

Closed nhoffman closed 1 year ago

nhoffman commented 1 year ago

Creation of an sqlite database completes in just a few minutes; I suspect that the bottleneck for the postgres load is the foreign key indexes. See if we can create the indexes after the initial data load.

nhoffman commented 1 year ago

version 1.10.0 will contain the following optimizations:

After all of this, some experiments on my laptop (2022 M1 pro)

sqlite

taxit -v new_database -z taxdmp.zip  57.15s user 3.09s system 93% cpu 1:04.41 total

indexes, psycopg2

taxit -v new_database postgresql://$(whoami)@localhost:5432/taxonomy -z   74.32s user 68.86s system 17% cpu 13:55.58 total

indexes, psycopg3

taxit -v new_database postgresql+psycopg://$(whoami)@localhost:5432/taxonomy   98.06s user 36.20s system 21% cpu 10:26.93 total

no indexes, psycopg2

taxit -v new_database postgresql://$(whoami)@localhost:5432/taxonomy -z   74.96s user 61.38s system 25% cpu 8:44.74 total

no indexes, psycopg3

taxit -v new_database postgresql+psycopg://$(whoami)@localhost:5432/taxonomy   77.77s user 58.15s system 38% cpu 5:49.32 total

So about a 2.4x speedup by dropping indexes and using psycopg3, but still considerably slower then sqlite.

nhoffman commented 1 year ago

I think we've addressed this to the best of our ability in v0.10.0