Database download taking a lot of disk space + taking too long

DaehwanKimLab / centrifuge

Classifier for metagenomic sequences

GNU General Public License v3.0

246 stars 73 forks source link

Database download taking a lot of disk space + taking too long #265

Closed pablorr24 closed 8 months ago

pablorr24 commented 8 months ago

I'm trying to build the database for my metagenomics analysis, and I've run the following commands but the database is taking too much disk space (bacteria was already 35gb and only 19% was downloaded).

centrifuge-download -o taxonomy taxonomy centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

I had to stop the download as I was running out of disk space. Aren't the databases supposed to take way less disk space? Can someone guide me on the right commands to create the database.

mourisl commented 8 months ago

The current refseq microbiome database is very huge, probably around 150GB nucleotide. This is the raw sequence size before building the index. The end index probably has a size around 80GB and also also requires about this amount of memory to run. Do you plan to run Centrifuge on your local machine or server? You need a large memory machine to create the index.

pablorr24 commented 8 months ago

Thanks for the quick response :) I'm currently working on my own machine, so I have limited space. Is there a way to have access to a smaller database or any other alternative that takes less than 50-60 gb?

mourisl commented 8 months ago

You can try our newer method Centrifuger: https://github.com/mourisl/centrifuger. We have a recently-created (2023/06) index at: https://zenodo.org/records/10023239 about size 45GB, though it includes the human genome as part of the index. It should take less than 50gb space and memory to run.