DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Centrifuge index not much smaller than input database FASTA #181

Closed mmp3 closed 4 years ago

mmp3 commented 4 years ago

I downloaded RefSeq genomes for bacteria (~15,000), archaea (~300), fungi (11), protozoa (3), and human. I also downloaded Genbank viral genomes (~33,000). Combined, this database is a 60 GB FASTA (uncompressed), of which 56 GB is the bacteria RefSeq genomes, and another 3 GB is the human genome.

centrifuge-build produced index files totaling ~37 GB. This is surprising because one major advantage of centrifuge is supposed to be its ability to make a condensed representation of the database that is small enough to be able to fit into memory e.g. on a standard laptop.

This is also surprising because this database is almost the same as the "p+h+v" database hosted on the centrifuge website at JHU, yet the "p+h+v" database is only 11.3 GB uncompressed.

Why is the index so large? How can nearly the same database produce index files which are 3.5 times larger than the "p+h+v" database hosted on JHU?

Also, where is the option to build a "compressed" in centrifuge-build? No such option is listed in the command line help, as far as I can tell.

Thank you!

mourisl commented 4 years ago

The p+h+v on our website was built in 2016, and the database of bacteria has increased a lot afterwards.

In the indices folder of Centrifuge, you can run "make p_compressed+h+v" to obtain the database you desired. You can change the number of threads used as well and you need to add centrifuge folder to $PATH. Note that the compression is fairly slow procedure.

fanninpm commented 3 years ago

Note that the compression is fairly slow procedure.

Especially when it tries to compress an Escherichia coli K-12 ER2796 chromosome that's roughly 100 times the size that it should be.