DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

The second file of customer db is empty #264

Closed LilyAnderssonLee closed 7 months ago

LilyAnderssonLee commented 8 months ago

Hi, I am In the process of building a database using RefSeq data that covers bacteria, viral, archaea, fungi, parasite, protoza, plasmid and even contaminants. The input data is quite large, around 1.3TB in size.

However, I've run into an issue where the second file db.2.cf, always turns out empty. Has anyone else had this problem? Here is the code I've been using:

!/bin/bash

SBATCH -A xx

SBATCH -p core

SBATCH -n 50

SBATCH -t 10-00:00:00

SBATCH -J centrifuge_db

SBATCH --mem=400GB

centrifuge-build -p 50 --bmax 3342177280 --conversion-table seqid2taxid.map \ --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \ input-sequences.fna db

mourisl commented 8 months ago

I think for 1.3TB sequences, you may need about 3TB memory to build the index...

LilyAnderssonLee commented 7 months ago

@mourisl Thanks for your response. It's sad that I don't have sufficient memory available. I suppose I'll need to reduce the data size, perhaps by only keeping the representative genome for each species.

LilyAnderssonLee commented 7 months ago

@mourisl I am wondering what is the k-mer length used during genomes compression in the centrifuge database h+p+v+c or what is the default k-mer in database construction?

Are you planning to update the Centrifuge databases or create Centrifuge databases based on all RefSeq genomes?

mourisl commented 7 months ago

Centrifuge itself does not use k-mers. For the compression part, it use 31-mers, but this k-mer is used to cluster more similar strains from the species, so the information is not directly used in the compression either.

For the recent RefSeq prokaryotic genomes, the size is too huge, and the index size is above 80GB, which is beyond the limit from Zenodo...