DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

Approximating memory use by bucket size and difference-cover period #168

Open piyuranjan opened 5 years ago

piyuranjan commented 5 years ago

Hi Centrifuge developers,

I am trying to build a Centrifuge database with RefSeq sequences from bacteria, archaea, fungi, virus and protozoa along with the human genome. The total file size for the sequences is around 590 GB. While doing this on a cluster, the maximum amount of memory I might have access to is 1 TB (2.27 GHz x 40 Xeon cores) but I would prefer asking for only 500 GB (2.27 GHz x 20 Xeon cores) because it's faster to get in a queue. In order to do this, would you have any suggestions on the bucket size and difference-cover period parameters that I could use?

I was trying to find an estimate of memory size Centrifuge would require while I use different settings for the -bmax and -dcv but was unable to find anything concrete. Is there a rough estimate chart available that would allow me to make a best decision on these parameters? I have seen the recentrifuge wiki where the authors use --ftabchars=14 -p 32 --bmax 1342177280 and get the nt database created on 500GB RAM on a cluster in 20 hours. I apologize for my naivete, but is there an explanation to how does the memory scale with increasing/decreasing these numbers?

From your documentation about Centrifuge build, it seemed to me that --offrate and --ftabchars will also change the memory used during an index build. But changing those parameters, likely also changes the compression of the index, right? Is there also an estimate available for these parameters in terms of the changes in memory usage and accuracy of taxonomic assignment done by Centrifuge? Would you recommend changing these parameters as well for better resource usage during the index build without compromising the accuracy?

Please let me know if you need more information about my use case.

Thank you!