DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
718 stars 270 forks source link

NT build stalling #560

Open jhnath21 opened 2 years ago

jhnath21 commented 2 years ago

I am having the same issue as stated in #534 with the NT database build. I used a very large AWS ec2 instance to try to build the NT database (u-6tb1.56xlarge, 224 threads, 6144 GB RAM, using CentOS 7). It gets to a point in processing the sequences and then it stalls. After sitting for 6 days at the same point (the number of sequences processed), I terminated the instance due to the cost of the instance. I tried with several different thread counts and all had the same outcome.

Any suggestion as we like to use NT instead of RefSeq as not all organisms have been placed into RefSeq (plus there are a lot of errors during the downloads from the standard libraries as files can't be downloaded that are in the assembly files). Building the Kraken2 NT database was not an issue until the COVID-19 pandemic when the amount of SAR-CoV-2 isolate sequences exploded the size of the NT database (nt.fasta is now over 600 GB in size) and I previously built the NT db on a r5.24xlarge instance in ~1 day.

Any suggestions on how to get the NT database to finish building or a way to get a db that is similar to NT?

dschnei1 commented 2 years ago

Hello! Same issue here. Actually it crashed the entire server after ~6 days.

kraken2-build --build --db nt --threads 32
Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map already present, skipping map creation. Estimating required capacity (step 2)... Estimated hash table requirement: 369344381804 bytes Capacity estimation complete. [2h4m27.939s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Processed 17932232 sequences (89183960569 bp)...

At this point the server also begins to respond slower.

dschnei1 commented 2 years ago

Now it worked for me with --fast-build as mentioned here: https://github.com/DerrickWood/kraken2/issues/315

kraken2-build --build --db nt --threads 32 --fast-build Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map already present, skipping map creation. Estimating required capacity (step 2)... Estimated hash table requirement: 369344381804 bytes Capacity estimation complete. [2h4m26.832s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Completed processing of 78809741 sequences, 646655896621 bp Writing data to disk... complete. Database files completed. [22h31m24.186s] Database construction complete. [Total: 24h35m51.035s]

However, for me it is still unclear in which way it affects the database - are there any downsides?

jhnath21 commented 2 years ago

According to the help description: --fast-build: Do not require database to be deterministically built when using multiple threads. This is faster, but does introduce variability in minimizer/LCA pairs. Used with --build and --standard options.

Sounds like the impact on the database is that it might impact the ability to find the best LCA. I would be interested in hearing if anyone has done a side by side comparison of a database build with and without the --fast-build option and how it impacted their results.

jenniferlu717 commented 2 years ago

I would suggest using the pre-built databases here: https://benlangmead.github.io/aws-indexes/k2

I'm not 100% sure why the build is stalling but it could be due to memory. The nt database is fairly large.

--fast-build may cause false positives and/or two databases build from the same sequences may give slightly different results.

NeuerLiu2020 commented 2 years ago

I would suggest using the pre-built databases here: https://benlangmead.github.io/aws-indexes/k2

I'm not 100% sure why the build is stalling but it could be due to memory. The nt database is fairly large.

--fast-build may cause false positives and/or two databases build from the same sequences may give slightly different results.

Hello, first of all, thank you for your contribution to the species annotation. I used the database built by PlusPFP ( 1 / 27 / 2021 ) and found that only less than 9 % of the sequences were annotated

ac-simpson commented 2 years ago

I just commented on an older thread about this same issue - me too. I'll put my comment again here.


I'm also experiencing this issue - trying to build a database with all available nucleotide databases including nt. It's running on 80 threads, 430G of RAM. 71 threads are at 100%; the other nine appear to be doing nothing. The build has been running for 31 hours, but there have been no changes to the number of processed sequences or the time stamps on the hash files since yesterday.


Creating sequence ID to taxonomy ID map (step 1)...
Found 88061889/88247211 targets, searched through 824337935 accession IDs, search complete.
lookup_accession_numbers: 185322/88247211 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [32m58.241s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 448529039360 bytes
Capacity estimation complete. [4h9m1.816s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 22 bits reserved for taxid.
Processed 18062390 sequences (222177008859 bp...)

Has anyone figured out the problem? It seems that a common element is the nt database..

NeuerLiu2020 commented 2 years ago

Hello, sir ! I used 100 threads, 10G sequencing depth to annotate the species, taking 7 minutes. Therefore, I think your time-consuming seems to be much longer than I expected. You can check the annotation process. If you also want to use kraken2, you can try building data to resolve it.

research_liu2020

@. | On 3/31/2022 11:38,A.C. @.> wrote:

I just commented on an older thread about this same issue - me too. I'll put my comment again here.

I'm also experiencing this issue - trying to build a database with all available nucleotide databases including nt. It's running on 80 threads, 430G of RAM. 71 threads are at 100%; the other nine appear to be doing nothing. The build has been running for 31 hours, but there have been no changes to the number of processed sequences or the time stamps on the hash files since yesterday.

Creating sequence ID to taxonomy ID map (step 1)... Found 88061889/88247211 targets, searched through 824337935 accession IDs, search complete. lookup_accession_numbers: 185322/88247211 accession numbers remain unmapped, see unmapped.txt in DB directory Sequence ID to taxonomy ID map complete. [32m58.241s] Estimating required capacity (step 2)... Estimated hash table requirement: 448529039360 bytes Capacity estimation complete. [4h9m1.816s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Processed 18062390 sequences (222177008859 bp...)

Has anyone figured out the problem? It seems that a common element is the nt database..

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Lelouchzhu commented 2 years ago

I have encountered the same issue with nt database building with the latest kraken2 2.1.2. At first I thought it was RAM issue so I changed the max-db-size but no difference. In the meantime, I found installing the kraken2 through conda is only limited to version 2.0.7 but by using that version the problem was solved...

Still wondering why....

Hello! Same issue here. Actually it crashed the entire server after ~6 days.

kraken2-build --build --db nt --threads 32 Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map already present, skipping map creation. Estimating required capacity (step 2)... Estimated hash table requirement: 369344381804 bytes Capacity estimation complete. [2h4m27.939s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 22 bits reserved for taxid. Processed 17932232 sequences (89183960569 bp)...

At this point the server also begins to respond slower.

jhnath21 commented 2 years ago

So I continue to have the database build issue with kraken2 v2.1.2 for NT. I have been doing some testing and so far I have found that if I download all the files for the database using v2.1.2, I can get the database built in ~16 hrs using an r5.24xlarge AWS instance, using all 96 threads if I switch to kraken2 v2.0.7 and I don't need to use any flags. I am in the process of testing the other versions (2.0.8, 2.0.9, 2.1.0, and 2.1.1) to see which version I can upgrade to before the database building fails again.

Hope this helps:

632 #619

jhnath21 commented 2 years ago

So after testing the other versions (2.0.8, 2.0.9, 2.1.0, and 2.1.1) to see which version I can upgrade to before the database building fails again as mentioned above, only versions 2.0.7, 2.0.8, and 2.0.9 will successfully build NT. v2.1.0 and 2.1.1 freeze just like version 2.1.2 does..

Hope this helps:

632 #619

dan-ward-bio commented 1 year ago

Same issue as above with 1TB RAM 96T server. Kraken2 v2.1.2

Thanks @jhnath21 for your hack.