DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
718 stars 270 forks source link

Kraken2 database building stalls at some point. #428

Open troublov opened 3 years ago

troublov commented 3 years ago

Dear Kraken2 developers and community, I've used Kraken2 before but once accidentally deleted all its files... (btw its size helped me notice my fault just in time not to delete everything on server, ahaha) Now I'm trying to build a Kraken2 nt database again. Everything goes fine until the certain point:

kraken2-build --threads 10 --download-taxonomy --db nt
kraken2-build --threads 10 --download-library nt --db nt
time kraken2-build --threads 20 --build --db nt

Creating sequence ID to taxonomy ID map (step 1)...
Found 75377080/75377118 targets, searched through 802914419 accession IDs, search complete.
lookup_accession_numbers: 38/75377118 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [28m6.631s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 261205726352 bytes
Capacity estimation complete. [1h31m26.531s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 22 bits reserved for taxid.
Processed 13409722 sequences (68159316789 bp)...

At this point the process freezes and doesn't move for more than 5 days. Nevertheless all 20 threads are busy and 250G of server's RAM (ca. 25%) are occupied.

I tried to stop the process and start it again ----> it stops at the same place. I deleted all the files and downloaded everything from the beginning ----> stalls at the same place.

I'm using Kraken version 2.1.1 installed through conda.

There're several almost identical issues throughout the web but they're left unanswered. Hope this time somebody will come up with the solution cos the software is fantastic and I don't want to give up using it.

Thanks in advance, my saviour!

idhaase commented 3 years ago

Hi,

we have the same problem with almost identical numbers here:

/opt/kraken2/kraken2-build --build --threads 48 --db nt_20210325
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 284862311860 bytes
Capacity estimation complete. [1h20m5.981s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 22 bits reserved for taxid.
Processed 13407782 sequences (68157549570 bp)...

This is the final state, reached after about 7 hours. CPU usage looks like it was doing fine, but no progress for > 2 days.

We also had 38 unmapped accession numbers, could this cause the problems?

Best, dirk

troublov commented 3 years ago

Hi Dirk, I also have about 40 unmapped accessions. What about your RAM usage, how many does the process occupy? The curious thing is that the process stops at almost the same place. I have nt database from the week previous to yours and for this reason, I think, the number differs a bit. Actually I think that the problem is in some particular "broken" sequence which ruins the process. As I understood an input file of the process during which we meet the problem is "seqid2taxid.map". I thought of unmapped sequence(s) getting into it and corrupting the process but couldn't find any proof fot my theory. I searched for the presence of unmapped accessions in it and didn't find any. Hm, may be if we append manually all the accessions of unmapped sequences with corresponding taxid into the file it will help.... who knows? My other thought was something about RAM preallocated for the database. Like kraken2 reserves automatically (if you don't specify the value explicitly) some amount which appears to be insufficient and later on couldn't change it so everything freezes. Haven't done anything to check the clue as I'm pretty nooby in such stuff.

Just shared my ideas with you and everyone just in case someone can find a work around.

Good luck, Iura

troublov commented 3 years ago

Dear @jenniferlu717, @martin-steinegger, @BenLangmead, @dfornika and @DerrickWood, looks like it's a common issue: https://github.com/DerrickWood/kraken2/issues/423#issuecomment-797523967 Could you advice us what to do with the problem?

With best regards, Iura

troublov commented 3 years ago

As suggested by @Pavel-Zykin in: https://github.com/DerrickWood/kraken2/issues/315#issuecomment-703097803

Flag --fast-build helps.

Dear devs, if you'll read this somewhen, please, fix the reported issue as lots of people encouters it.

All the best, Iura

fanninpm commented 3 years ago

Please reopen the issue, as using --fast-build is only a workaround and not a proper fix.

idhaase commented 3 years ago

We had suspected that a missing dustmasker could be the reason, but it wasn't. Then we tried again with a fresh library download, but it still looks very familiar:

/opt/kraken2/kraken2-build --build --threads 64 --db nt_20210330Creating sequence ID to taxonomy ID map (step 1)...
Found 75641493/75641635 targets, searched through 807306475 accession IDs, search complete.
lookup_accession_numbers: 142/75641635 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [1h36m6.914s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 263154360320 bytes
Capacity estimation complete. [1h18m59.286s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 22 bits reserved for taxid.
Processed 13409037 sequences (68152409894 bp)...

CPUs remain busy and about 250 GB of RAM used, but no progress for days. We will go for fast-build now, but we agree that this is just a workaround.

fanninpm commented 3 years ago

Every time this stalls (not using --fast-build), one of the threads gets stuck in an "interruptible sleep" state while the rest of the threads continue to run.

pengtb commented 3 years ago

Maybe the issue is related to multi-threading. Did someone try with only 1 thread? Does it work?

douglasgscofield commented 3 years ago

Same issue here on a 512GB node attempting to build nt with 16 threads and 2.1.1, stopped after 29 days and going with --fast-build

donovan-h-parks commented 2 years ago

@DerrickWood, I'm also experiencing this issue with Kraken 2.1.2 on a machine with 512 GB and 64 CPUs. The build process was run with 64 threads. 62 threads/CPU are working at ~100%, 1 thread is working at ~1%, and the other thread is reporting 0% activity. The build process is only using ~250 GB so memory shouldn't be an issue.

ac-simpson commented 2 years ago

I'm also experiencing this issue - trying to build a database with all available nucleotide databases including nt. It's running on 80 threads, 430G of RAM. 71 threads are at 100%; the other nine appear to be doing nothing. The build has been running for 31 hours, but there have been no changes to the number of processed sequences or the time stamps on the hash files since yesterday.

Creating sequence ID to taxonomy ID map (step 1)...
Found 88061889/88247211 targets, searched through 824337935 accession IDs, search complete.
lookup_accession_numbers: 185322/88247211 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [32m58.241s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 448529039360 bytes
Capacity estimation complete. [4h9m1.816s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 22 bits reserved for taxid.
Processed 18062390 sequences (222177008859 bp...)

Has anyone figured out the problem? It seems that a common element is the nt database..

lmolokin commented 1 year ago

Same issue here where the build seems to hang with all cores and memory in use.

kraken2-build --build --db $DBNAME --threads 16

Creating sequence ID to taxonomy ID map (step 1)...
Found 108830/111127 targets, searched through 972184251 accession IDs, search complete.
lookup_accession_numbers: 2297/111127 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [2m33.622s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 159501268112 bytes
Capacity estimation complete. [2h17m23.912s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
Processed 770068 sequences (67284665935 bp)...
Cloudptj commented 4 months ago

I'm encountering the same issue with kraken 2.1.3 on a hpc cluster. I applied for 75 CUPs and 2625 GB memory. However, it still got stuck. No more new output in the last 36 hours. The slurm output is: Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map complete. [11m10.804s] Estimating required capacity (step 2)... Estimated hash table requirement: 1032414842880 bytes Capacity estimation complete. [8h6m23.246s] Building database files (step 3)...

The slurm error is: Found 112147524/112739908 targets, searched through 1030945103 accession IDs, search complete. lookup_accession_numbers: 592384/112739908 accession numbers remain unmapped, see unmapped.txt in DB directory Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid.