DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
245 stars 73 forks source link

Centrifuge-Build nt Memory Requirements #73

Open feltzmc opened 7 years ago

feltzmc commented 7 years ago

Hi all,

I am attempting to build an up-to-date Centrifuge nt index on a machine with 40 cores and 196 GB of RAM. Your example for building nt suggests using 16 threads and a bmax of 1342177280, which I understand is the maximum number of suffixes per bin. I am attempting to build nt using these settings and each time I have attempted it the build dies with "Killed". I believe the issue is that the build is running out of memory but I can't find any information on how much memory is needed to successfully build an nt index.

Could you add some information to the manual about choosing suitable settings for building nt? Maybe something on how bmax and threads might translate to maximum memory usage?

Thanks, Matt

feltzmc commented 7 years ago

Can now confirm memory was the issue, nt index built successfully using 16 threads on a machine with 488GB of memory. Not sure the index built correctly however, all reads are classified to "no rank".

feltzmc commented 7 years ago

Discovered why the summary file only contained the classification "no rank". Directions for building nt index are out of date, need to generate a map file using nucl_gb.accession2taxid instead of gi.

khyox commented 7 years ago

@feltzmc, thanks for all the updates. As you correctly mentioned, instructions for building the nt index are obsolete. I successfully built nt index database using 32 threads on a fat node with 512 GB of memory. It took ~15 hours.

rvaerle commented 6 years ago

Unfortunately, I don't have access to a machine with that much RAM. Does anyone have a relatvely recent nt index database for centrifuge that they would like to share?

khyox commented 6 years ago

Sure, please check the Centrifuge-nt download section of the Recentrifuge wiki. There, you also have detailed and updated instructions about how to build your own Centrifuge databases, just in case.

dunedice commented 4 years ago

Is this true? The centrifuge directions from the site are not updated?

I tried:

~/Downloads/centrifuge-1.0.3-beta/centrifuge-build --conversion-table gi_taxid_nucl.map --taxonomy-tree taxdump/nodes.dmp --name-table taxdump/names.dmp nt.fa nt Settings: Output files: "nt..cf" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: nt.fa Reading reference sizes Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps Time reading reference sizes: 01:18:10 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Killed: 9

I have tried this job with 500gb of memory on a cluster. seriously. i thought centrifuge was selling itself on, oh we are simple enough to run on a laptop and don't need that much memory.

Is there an updated centrifuge guide? because i have:

https://ccb.jhu.edu/software/centrifuge/manual.shtml#running-centrifuge