DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

Support for the new taxid files (accession2taxid) for building nt dabase? #210

Closed cae803 closed 2 years ago

cae803 commented 3 years ago

Hi, authors. Thank you for distributing Centrifuge!

I'd like to build nt database by referring to the manual. However, I have an issue with getting a map file.

The gi_taxid_nucl.dmp.gz seems to be out of date. The readme file of the gi_taxid file (ftp://ftp.ncbi.nih.gov/pub/taxonomy//obsolete/gi_taxid.readme) says "the gi_taxid* files update in this directory has been discontinued. Please use files from directory ./accession2taxid".

Are there any plans to support the accession2taxid file? It would be nice to be able to use the new taxid file.

mourisl commented 3 years ago

The manual might be a bit out-of-date. To build nt database, you can try the "make nt" command in the indices folder. You can check the Makefile there for more details.

cae803 commented 3 years ago

Dear @mourisl , Thank you for the suggestion. I'll try the command!

vscmarques commented 2 years ago

Hello @cae803 Have you successfully built the nt database? I am having a lot of difficulties. If you could please tell me if you managed to do it and how, I would be extremely thankful. Thank you!

cae803 commented 2 years ago

Hi @vscmarques I have stopped the construction of the database due to insufficient memory. My error is as follows.

Calculating joined length
Writing header
Reserving space for joined string
Could not allocate space for a joined string of 441911838016 elements.
Please try running centrifuge-build on a computer with more memory.
Total time for call to driver() for forward index: 01:47:33
Error: Encountered internal Centrifuge exception (#1)
Command: centrifuge-build-bin --wrapper basic-0 -p 1 --ftabchars=14 --conversion-table /dev/fd/63 --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt-dusted.fna tmp_nt/nt
Deleting "tmp_nt/nt.1.cf" file written during aborted indexing attempt.
Deleting "tmp_nt/nt.2.cf" file written during aborted indexing attempt.
Deleting "tmp_nt/nt.3.cf" file written during aborted indexing attempt.
Deleting "tmp_nt/nt.1.cf" file written during aborted indexing attempt.
Deleting "tmp_nt/nt.2.cf" file written during aborted indexing attempt.
Deleting "tmp_nt/nt.3.cf" file written during aborted indexing attempt.

I am currently arranging additional memory.

cae803 commented 2 years ago

Hi @vscmarques I finally completed building nt database using the following command! make THREADS=32 nt I referred to this wiki: https://github.com/khyox/recentrifuge/wiki/Centrifuge-nt

It required about 300GB of memory in my workstation. Here is the time consumed. real 513m38.670s user 4125m4.993s sys 75m53.097s