Ecogenomics / GTDBNCBI

The GTDB provides the software infrastructure for working with a large collection of genomic resources. The major goal of this initiative is to provide a phylogenetically consistent taxonomy for archaea and bacteria.
https://gtdb.ecogenomic.org/
GNU General Public License v3.0
9 stars 2 forks source link

Calculating and storing metadata is time consuming #29

Open donovan-h-parks opened 8 years ago

donovan-h-parks commented 8 years ago

It can take several hours to calculate and store metadata for large numbers of genomes. This may be due to the size of the database transaction. Can this be improved? Is it simple enough to do this in parallel?

[2016-02-06 17:41:39] INFO: GTDB v0.0.2 (NCBI database 2015-11-27)
[2016-02-06 17:41:39] INFO: gtdb -t 40 genomes add --create_list abisko_assembly73_bins --checkm_results ../checkm/CHECKM_FILE --batchfile batchfile --study_file study_file
[2016-02-06 17:41:39] INFO: Adding genomes to database.
[2016-02-06 17:41:39] INFO: Parsing Study file.
[2016-02-06 17:41:39] INFO: Reading CheckM file.
[2016-02-06 17:44:26] INFO: Running Prodigal to identify genes.
==> Finished processing 1529 of 1529 (100.00%) genomes.
[2016-02-06 18:07:15] INFO: Calculating and storing metadata for each genome.
[2016-02-07 00:33:34] INFO: Identifying TIGRfam protein families.
==> Finished processing 1529 of 1529 (100.00%) genomes.
[2016-02-07 04:12:12] INFO: Identifying Pfam protein families.
==> Finished processing 915 of 1529 (59.84%) genomes.

Not the time for "Calculating and storing metadata for each genome." though.

pchaumeil commented 8 years ago

The multithread step has been implemented and will be available in the next release of GTDB

donovan-h-parks commented 8 years ago

Great!