DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
733 stars 274 forks source link

Issues while building a custom database with kraken2-build - extremely small `hash.k2d` file #391

Open pgrzesik opened 3 years ago

pgrzesik commented 3 years ago

Hello,

thank you for all your work on kraken2 :bow:

Kraken version: 2.1.1

I'm currently struggling with an issue where the output database from kraken2-build is very small and the database seems to be empty - hash.k2d has only ~ 47.2 KB. I was following the WIKI instructions https://github.com/DerrickWood/kraken2/wiki/Manual while trying to troubleshoot but no luck so far, here are the specific commands that I'm using

kraken2-build --download-taxonomy --db $DBNAME
kraken2-build --download-library archaea --db $DBNAME
kraken2-build --download-library viral --db $DBNAME
kraken2-build --download-library bacteria --db $DBNAME
kraken2-build --build --threads 8 --db $DBNAME

Unfortunatelly, the resulting database seems to be empty/incorrect, here's the output from kraken2-inspect --db $DBNAME

# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 29772
# Table size: 0
# Table capacity: 12068
# Min clear hash value = 0

There were no errors during the process and I'm confused as to what might be the cause of it.

My ultimate goal here is to build two smaller databases with max-db-size 6000000000 and max-db-size 2000000000 but due to the above issues, I'm not able to complete it.

Does anyone have an idea what might be causing it or maybe had successfully prebuilt such smaller databases previously and could share them? (original Kraken had 4GB version of the database, Kraken 2 does not :()

mihkelvaher commented 3 years ago

Is 8GB capped standard still too large? https://benlangmead.github.io/aws-indexes/k2 Have you tried building only one db (for example viral)? Do downloaded sequences in $DBNAME/library/*/library.fna look ok?

pgrzesik commented 3 years ago

Hello @mihkelvaher, thanks for responding. Unfortunatelly, it's a bit too large, I'm trying to run kraken2 on Jetson Xavier NX and Jetson Nano, which have 8 and 4 GBs of memory, respectively. I've tried building database with only one library, but it seems like the hash.k2d is just never populated, it always stays the same (empty), even if sequences are processed (seemingly) correctly.

Downloaded sequences look okay as well

PeterCx commented 3 years ago

Hi there,

I believe I am having the same issue.

I have built the GTDB database but the output hash.k2d is incorrect. As per @pgrzesik post, when I inspect I get almost the same output as shown above. Also the database takes 5 minutes to build. But size and memory is not an issue for me.

I do not get any error when building either. Any help is greatly appreciated. Thanks

pgrzesik commented 3 years ago

Hey @PeterCx :wave: I've managed to bypass the issue by not running masking during library downloads. You might want to try downloading the libraries with --no-masking flag and optionally running masking separately.

donovan-h-parks commented 3 years ago

Hi. I'm running Kraken 2.1.2 and am having the same issue. Specifically, if I run:

kraken2-build --download-library viral --db viral

The resulting library.fna.masked file is empty while library.fna is 451 MB in size. So, it appears the masking has silently failed and produced and empty file.