DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

centrifuge-build hanging up #199

Open GastonViarengo opened 3 years ago

GastonViarengo commented 3 years ago

Hello everyone. I've recently started using Centrifuge, and I've been able to create a viral index and use it with my metagenomic data. However, when I'm trying to build a bacteria index (bac), the process hangs up (at least that's the only explanation I've encountered so far). I'm using the following script:

centrifuge-build -p 8 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp inputs/seq_bac.fna indices/bac

The files bac.1.cf, bac.2.cf, and bac.3.cf, are created within a few minutes after the job begins, but file bac.2.cf is 0 kb size. The output shows:

Settings: Output files: "indices/bac..cf" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: inputs/seq_bac.fna Reading reference sizes Warning: Encountered reference sequence with only gaps Time reading reference sizes: 00:07:04 Calculating joined length Writing header Reserving space for joined string Could not allocate space for a joined string of 67127059294 elements. Switching to a packed string representation. Reading reference sizes Warning: Encountered reference sequence with only gaps Time reading reference sizes: 00:07:04 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:07:05 Warning: taxomony id doesn't exists for NC_017270.1! (repetead several times for different ids) Warning: Taxonomy ID 90270 is not in the provided taxonomy tree (taxonomy/nodes.dmp)! (repetead several times for different ids)

Even after leaving it running for a few days, bac.*.cf files do not show modifications, and output is freezed (I believe hanged up).

I've tried removing the erroneus IDs but the process still hangs up.

Could you help me understand what's going on in order to solve this?

Thank you so much!

Best regards

Prof. Dr. Gastón Viarengo Institute of Molecular and Cellular Biology of Rosario (IBR-CONICET) Human Virology Lab

mourisl commented 3 years ago

Sorry for the delayed reply, which version of Centrifuge did you use? Thank you.

GastonViarengo commented 3 years ago

Sorry for the delayed reply, which version of Centrifuge did you use? Thank you.

Hello Li Song, no problem, thanks for your response. I'm using versión 1.0.4-beta. Could you help me find out the problem? Thank you.

mourisl commented 3 years ago

I just checked the log and realized that I fixed this bug after the release of 1.0.4-beta. Can you try git clone to get the most recent version of Centrifuge? Thank you.

GastonViarengo commented 3 years ago

Thanks Li Song, I'll try with that and let you know how it goes. What was the bug?. Bests, Gastón.

fanninpm commented 3 years ago

I also ran into this (or a similar issue) while I was using the provided Makefile to make an nt database. Compiling 65c42fc from source did not change anything.

choede commented 2 years ago

Hi, I have a similar issue with nt. I'm using version 1.0.4. I modified map file to have something starting with : accession.version taxid A00001.1 10641 A00002.1 9913 A00003.1 9913 A00004.1 32630 A00005.1 32630 and launched centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.2map \ --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \ nt.fa nt

After one hour, the process do not write anything else. nr.1.cf and nt.3.cf are not empty but nt.2.cg is empty. I have only warning in output logs. The process uses only one CPU. Moreover, nt indexes available in centrifuge web site are not up to date (They are from 2018). Could you help me, please ? Thanks a lot in advance

Jolvii85 commented 2 years ago

Hi all, I have the same error with nt, anyone fix it?

savytskanatalia commented 2 years ago

Hi all, I have similar problem with a custom database. Did anyone figure it out?

Jolvii85 commented 2 years ago

I gave up finally!

On Mon, May 23, 2022 at 10:45 AM Natalia Savytska @.***> wrote:

Hi all, I have similar problem with a custom database. Did anyone figure it out?

— Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/centrifuge/issues/199#issuecomment-1134371090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGTQVVQISC777JRSXRPHSQDVLNATJANCNFSM4RKXA3TQ . You are receiving this because you commented.Message ID: @.***>

wittler-github commented 1 year ago

For me this occurred error "Warning: taxomony id doesn't exists for NC_0####.1! (repetead several times for different ids)" it was that when I concatenated several seqid2taxid.maps it sporadically missed a newline at a junction between two files which made centrifuge miss all the NCBI taxid entries after that, when running centrifuge-build

ramnageena11 commented 1 year ago

is there any solution, if anyone got? I am in this situation from last 20 days.

Thank Ram

ramnageena11 commented 1 year ago

Hello Any suggestions.

ramnageena11 commented 1 year ago

Hi It seems I need to change the strategy to analyze my data. Any suggestion other than Centrifuge? I am using Long reads data from ONT, does "Kraken2" will work for Taxonomy analysis?

Pls suggest. Thanks RNS

ramnageena11 commented 1 year ago

hi

sarah-buddle commented 10 months ago

Hi, have there been any updates on this issue? I am encountering the same thing.

mourisl commented 10 months ago

How much memory do you have on your server and which database are you building? Thank you.

sarah-buddle commented 10 months ago

I am trying to build a custom database based on bacteria, viral, fungi and protozoa downloaded from RefSeq. I'm running centrifuge v1.0.4, and have tried with the conda installation and installed from source. The total size of my fasta file is 148GB. On my last attempt to build, I tried with 80GB of memory and 8 cores. I didn't get any error messages about running out of memory, I just got warnings e.g. "Warning: taxonomy id doesn't exists for NCxxx" as above, and the output file refseq.4.cf was empty. I have access to more memory though, so I could try with that. The command I used to build was: centrifuge-build --conversion-table ${db}/seqid2taxid.map --taxonomy-tree ${software}/taxdump/new_taxdump_2023-08-01/nodes.dmp --name-table ${software}/taxdump/new_taxdump_2023-08-01/names.dmp ${db}/refseq_all_genomic.fasta refseq -p 8

mourisl commented 10 months ago

With 148G sequence, I think you may need about 600GB memory to build the index. You can increase --dcv and --bmax values to reduce the memory, but may taking longer time to build.

sarah-buddle commented 10 months ago

OK thank you, I will try that!