DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
727 stars 273 forks source link

How long does kraken2-build for nr database use 100 threads #463

Open hdshu opened 3 years ago

hdshu commented 3 years ago

Hi,when I use kraken2 to build nr database, it shows no miastake but seems stopped. This status continued more than three days.

MY kraken2 version is 2.1.1 and my command is "kraken2-build --build --db . --threads 100" submitted by bsub

Here is the output: Creating sequence ID to taxonomy ID map (step 1)... Found 75781867/75905251 targets, searched through 805976720 accession IDs, search complete. lookup_accession_numbers: 123384/75905251 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [18m13.958s] Estimating required capacity (step 2)... Estimated hash table requirement: 272905059472 bytes Capacity estimation complete. [1h27m38.493s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid.

For more than three days, the program shows no new output, seems it stopped. If this is because build nr database need long time or just the program stopped but report no mistakes?

hdshu commented 3 years ago

It's been eight days, the program still in this step..

jenniferlu717 commented 3 years ago

I don't believe it took that long when I tested it last. but I'll try it again. Something might be wrong with your database build.

hdshu commented 3 years ago

I don't believe it took that long when I tested it last. but I'll try it again. Something might be wrong with your database build.

Thank you for your reply. It still in this step, so I stopped it and run again now.

bingjielu commented 3 years ago

Hi, I also try to build the database by using all of the genomes from refseq. I want to update the database sharing by Daniel Fischer (https://lomanlab.github.io/mockcommunity/mc_databases.html). And the program has been running for almost a month. So I am confused.

MY kraken2 version is also 2.1.1 and my command is "kraken2-build --build --db genome --threads 108" submitted by nohup.

Here is the output: Masking low-complexity regions of new file... done. Added "genomes.tax.fna" to library (genome) Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map complete. [14.267s] Estimating required capacity (step 2)... Estimated hash table requirement: 210934680136 bytes Capacity estimation complete. [1h31m52.202s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 16 bits reserved for taxid.

I seem to have a similar problem, the program shows no new output, but the state of the program is running. What should I do now ?

hdshu commented 3 years ago

Hi, @bingjielu and @jenniferlu717 , I update my nt database and rerun the program with same command locally, maybe the problem was solved.

Here is the output: Creating sequence ID to taxonomy ID map (step 1)... Found 78013957/78013961 targets, searched through 810607933 accession IDs, search complete. lookup_accession_numbers: 4/78013961 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [48m1.117s] Estimating required capacity (step 2)... Estimated hash table requirement: 286494781440 bytes Capacity estimation complete. [2h5m4.242s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Processed 1437119 sequences (7577232805 bp)..

Maybe the problem comes from the jobs submitted by bsub.

bingjielu commented 3 years ago

Thanks for your reply. I will try it. @hdshu

hdshu commented 3 years ago

Hi, @bingjielu and @jenniferlu717 New problems have arisen!

My command is: ~/software/kraken2/kraken2-build --build --threads 130 --db /gss1/home/hdshu/kraken_nr it stopped in this step, the program didn't process new sequence. This problem same as #315 and #428([https://github.com/DerrickWood/kraken2/issues/428]). Now I'm trying --fast-build

The output: Creating sequence ID to taxonomy ID map (step 1)... Found 78013957/78013961 targets, searched through 810607933 accession IDs, search complete. lookup_accession_numbers: 4/78013961 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [48m1.117s] Estimating required capacity (step 2)... Estimated hash table requirement: 286494781440 bytes Capacity estimation complete. [2h5m4.242s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Processed 13392110 sequences (68090807414 bp)...

jenniferlu717 commented 3 years ago

I used far less than 100 threads and it took only 4 hours?

andreott commented 3 years ago

Hi, I am facing the same problem, nr database build process is now running 5 days on 20 threads. When I did not use the flag --protein for the --download-taxonomy step, the build was fast but all ids were unmapped. When using the --protein flag, the correct id-to-tax mapping file is used and now the build runs forever. @jenniferlu717 did you use the correct name mapping file in your test build?

andreott commented 3 years ago

Hi, Update: after more than 10 days the build process was still running and had to be aborted. So there seems to be some strange issue with the protein mode of kraken2.