bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.02k stars 183 forks source link

Failed building databases using GTDB-taxdump taxonomy files #611

Open emilhaegglund opened 1 year ago

emilhaegglund commented 1 year ago

I was trying to build a database using the taxonomy files from gtdb-taxdump, however it failed when reading the names.dmp with the following message:

zcat  gtdb_proteomes/* | diamond makedb --db gtdb --taxonnames gtdb-taxdump/R207/names.dmp --taxonnodes gtdb_data/gtdb-taxdump/R207/nodes.dmp --taxonmap gtdb.protein.taxid.map

diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Input file parameter (--in) is missing. Input will be read from stdin.
Opening the database file...  [0s]
Loading sequences...  [0.772s]
Masking sequences...  [0.157s]
Writing sequences...  [0.035s]
Writing accessions...  [0.072s]
Hashing sequences...  [0.013s]
Loading sequences...  [0s]
Writing trailer...  [0.003s]
Loading taxonomy nodes...  [28.213s]
Loading taxonomy names...  [78.105s]
Failed to allocate sufficient memory. Please refer to the manual for instructions on memory usage.

Here is an example of the names.dmp from gtdb-taxdump

head -20 gtdb-taxdump/R207/names.dmp
1   |   root    |       |   scientific name |
13926   |   001393675   |       |   scientific name |
14375   |   RUG14239 sp902797145    |       |   scientific name |
17689   |   001423155   |       |   scientific name |
20514   |   018334475   |       |   scientific name |
23859   |   013185635   |       |   scientific name |
34402   |   002214285   |       |   scientific name |
38289   |   001509495   |       |   scientific name |
66445   |   009903045   |       |   scientific name |
74747   |   000419015   |       |   scientific name |
78978   |   014222245   |       |   scientific name |
85313   |   001742655   |       |   scientific name |
88808   |   E44-bin52 sp004375875   |       |   scientific name |
121310  |   001585965   |       |   scientific name |
138721  |   VXYK01  |       |   scientific name |
147972  |   007121265   |       |   scientific name |
151528  |   007830495   |       |   scientific name |
157756  |   003411905   |       |   scientific name |
160336  |   002878095   |       |   scientific name |
173955  |   001247185   |       |   scientific name |

Do you have any idea why this could be happening? I haven't had any problems building databases with the NCBI-taxdumps.

Best regards, Emil Hägglund

bbuchfink commented 1 year ago

The taxids used in these files are > 2^31, that is not supported at the moment. I'll see what I can do about this.

emilhaegglund commented 1 year ago

Ah, suspected it was something like this. Then I know the cause of the error. Thanks for the quick reply!