DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

not able to build database, taxonomy does not exist warnings, latest version by git #266

Open koppk opened 8 months ago

koppk commented 8 months ago

Hi, I saw other people also run into this problem. I cannot build a database and the problem seems to be the domain bacteria.

Thousands of warnings like that Warning: taxonomy id doesn't exists for NZ_OY986432.1! Warning: taxonomy id doesn't exists for NZ_OY986431.1! Warning: taxonomy id doesn't exists for NZ_OY986433.1! Warning: Taxonomy ID 1516075 is not in the provided taxonomy tree (taxonomy/nodes.dmp)! Warning: Taxonomy ID 3111325 is not in the provided taxonomy tree (taxonomy/nodes.dmp)! Warning: Taxonomy ID 3111776 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!

While latest version from source: git clone https://github.com/DaehwanKimLab/centrifuge cd centrifuge make sudo make install prefix=/usr/local

Also tried not only centrifuge-download -o taxonomy taxonomy But also downloading and extracting the latest (Jan 2024) taxdump.tar.gz directly from NCBI

No difference. File 2cf always empty, size 0 as others have observed before!

What I run on a 32 cores, 256 G RAM scaleway instance: nohup centrifuge-build -p16--conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna abfpv &

Before I had built and concatenated the archea,bacteria, viral, fungi, protozoa seqid2taxid.map files and concatenated them into one.

This is very frustrating! I had used centrifuge successfully before but wanted to apply the latest version for some urgent research ...

mourisl commented 8 months ago

The warnings should be fine. I think the issue is still the memory. The current bacteria database is quite large, so the 256G memory is likely not enough (but fairly close). You can try option "-a" with smaller values for "--bmax" and a larger value for "--dcv".