fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
223 stars 44 forks source link

kmer found in sequence w/ taxid xxxx that is not in database #44

Closed housw closed 5 years ago

housw commented 5 years ago

Hi Florian,

I'm building a microbial_nt database with this command: krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences. And I'm getting tons of the following warning messages on the screen:

kmer found in sequence w/ taxid
1002697128kmer found in sequence w/ taxid 10026971281002697128
kmer found in sequence w/ taxid  that is not in database1002697128 that is not in database that is not in database1002697128 that is not in databasekmer found in sequence w/ taxid 1002697128

kmer found in sequence w/ taxid 1002697128 that is not in database

kmer found in sequence w/ taxid  that is not in database that is not in database
kmer found in sequence w/ taxid kmer found in sequence w/ taxid 10026971281002697128
kmer found in sequence w/ taxid  that is not in database1002697128 that is not in database that is not in database
1002697128
kmer found in sequence w/ taxid kmer found in sequence w/ taxid 1002697128 that is not in database
kmer found in sequence w/ taxid 10026971281002697128 that is not in database that is not in database

kmer found in sequence w/ taxid 10026971281002697128kmer found in sequence w/ taxid  that is not in databasekmer found in sequence w/ taxid
 that is not in database1002697128kmer found in sequence w/ taxid kmer found in sequence w/ taxid kmer found in sequence w/ taxid 1002697128 that is not in database that is not in database

10026971281002697128
 that is not in databasekmer found in sequence w/ taxid  that is not in databasekmer found in sequence w/ taxid
 that is not in databasekmer found in sequence w/ taxid 1002697128
kmer found in sequence w/ taxid 1002697128 that is not in database1002697128
10026971281002697128 that is not in database
 that is not in databasekmer found in sequence w/ taxid  that is not in databasekmer found in sequence w/ taxid
 that is not in database
1002697128kmer found in sequence w/ taxid  that is not in database

kmer found in sequence w/ taxid kmer found in sequence w/ taxid 100269712810026971281002697128kmer found in sequence w/ taxid kmer found in sequence w/ taxid  that is not in database
1002697128 that is not in database1002697128
kmer found in sequence w/ taxid  that is not in database that is not in database that is not in databasekmer found in sequence w/ taxid

kmer found in sequence w/ taxid 10026971281002697128 that is not in database

Is it normal or am I doing anything wrong?

Thanks a lot, Shengwei

fbreitwieser commented 5 years ago

Hi @housw , are you using the DB shrinking option? If yes, those warnings are because of that, and I'll fix it.

housw commented 5 years ago

Hi Florian,

no, I'm not. I first downloaded the microbial_nt using the command: krakenuniq-download --db microbial_nt --dust microbial-nt, then tried to build the database using the command: krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences.

However, when I try to reproduce the error on another server, I haven't got those warnings yet. I'm afraid this is related to the configuration of our sever, I will keep investigating on it and keep you updated.

Here is the standard output from another server, it looks normal:

microbial_nt/taxonomy/nodes.dmp                    check [133.24 MB]
Extracting names file [tar -C microbial_nt/taxonomy -zxvf microbial_nt/taxonomy/taxdump.tar.gz names.dmp 1>&2] ...names.dmp
 done (took 3s)
microbial_nt/taxonomy/names.dmp                    check [167.97 MB]
: downloading ... gunzipping ... done.
: downloading ... done.
: downloading ... done.
Reading headers from nt file ... Got 50433364 ACs (took 23m13s).
Reading taxonomy tree from microbial_nt/taxonomy/nodes.dmp ... Got 149093 nodes (took 4s).
Reading AC to taxonomy ID mapping from microbial_nt/taxonomy/nucl_gb.accession2taxid.gz ... Done (took 7m31s).
Reading AC to taxonomy ID mapping from microbial_nt/taxonomy/nucl_wgs.accession2taxid.gz ... Done (took 12m46s).
Got mappings for 744310 taxa.
Writing microbial_nt/library/nt-bacteria.fna ... Done, wrote 6970922 sequences for 489772 taxa (took 51m53s).
microbial_nt/library/nt-bacteria.fna               check [62.53 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-bacteria.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-bacteria-dustmasked.fna.tmp && mv microbial_nt/library/nt-bacteria-dustmasked.fna.tmp microbial_nt/library/nt-bacteria-dustmasked.fna] ... done (took 2h9m42s)
microbial_nt/library/nt-bacteria-dustmasked.fna    check [62.62 GB]
Writing microbial_nt/library/nt-archaea.fna ... Done, wrote 350346 sequences for 13141 taxa (took 1m0s).
microbial_nt/library/nt-archaea.fna                check [1.13 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-archaea.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-archaea-dustmasked.fna.tmp && mv microbial_nt/library/nt-archaea-dustmasked.fna.tmp microbial_nt/library/nt-archaea-dustmasked.fna] ... done (took 2m52s)
microbial_nt/library/nt-archaea-dustmasked.fna     check [1.12 GB]
Writing microbial_nt/library/nt-viral.fna ... Done, wrote 2306462 sequences for 195811 taxa (took 7m13s).
microbial_nt/library/nt-viral.fna                  check [4.50 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-viral.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-viral-dustmasked.fna.tmp && mv microbial_nt/library/nt-viral-dustmasked.fna.tmp microbial_nt/library/nt-viral-dustmasked.fna] ... done (took 12m11s)
microbial_nt/library/nt-viral-dustmasked.fna       check [4.47 GB]
Writing microbial_nt/library/nt-fungi.fna ... Done, wrote 4420038 sequences for 157482 taxa (took 6m46s).
microbial_nt/library/nt-fungi.fna                  check [10.29 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-fungi.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-fungi-dustmasked.fna.tmp && mv microbial_nt/library/nt-fungi-dustmasked.fna.tmp microbial_nt/library/nt-fungi-dustmasked.fna] ... done (took 29m28s)
microbial_nt/library/nt-fungi-dustmasked.fna       check [10.26 GB]
Writing microbial_nt/library/nt-protozoa.fna ... Done, wrote 1608541 sequences for 58189 taxa (took 3m5s).
microbial_nt/library/nt-protozoa.fna               check [4.38 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-protozoa.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-protozoa-dustmasked.fna.tmp && mv microbial_nt/library/nt-protozoa-dustmasked.fna.tmp microbial_nt/library/nt-protozoa-dustmasked.fna] ... done (took 26m54s)
microbial_nt/library/nt-protozoa-dustmasked.fna    check [4.38 GB]
krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences
Found jellyfish v1.1.11
Kraken build set to minimize disk writes.
Finding all library files
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '31603295490'
K-mer set created. [4h43m3.044s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [9h6m14.867s]
Creating seqID to taxID map (step 4 of 6)..
15656309 sequences mapped to taxa. [4.443s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2054908 taxa
taxDB construction finished. [21.337s]
Building  KrakenUniq LCA database (step 6 of 6)...
 Adding taxonomy IDs for sequences
 Adding taxonomy IDs for genomes
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (313.245 GB) ... Done
Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 15656309 mappings.

Best, Shengwei

fbreitwieser commented 5 years ago

Hi Shengwei, I think I see the reason why it happens. Can you post the content of the microbial-nt/library folder? The non-dustmasked files may not be removed when building the nt database.

housw commented 5 years ago

Hi Florian,

yes, you're right, here are the files inside of microbial_nt/library:

library $ ls -1v
nt-archaea.fna
nt-archaea.fna.map
nt-archaea-dustmasked.fna
nt-bacteria.fna
nt-bacteria.fna.map
nt-bacteria-dustmasked.fna
nt-fungi.fna
nt-fungi.fna.map
nt-fungi-dustmasked.fna
nt-protozoa.fna
nt-protozoa.fna.map
nt-protozoa-dustmasked.fna
nt-viral.fna
nt-viral.fna.map
nt-viral-dustmasked.fna

So should I remove all the non-duskmasked fasta and map files, and re-build it again?

Best, Shengwei

fbreitwieser commented 5 years ago

Hi Shengwei, yes please do that. Remove the non-dustmasked files and remove all the files in the top level db directory (rm microbial_nt/*), and call the build command again. You can leave the issue open - I'm fixing the download script.

housw commented 5 years ago

Hi Florian,

that's great, thanks a lot for your kind help! I'll let you close it when you've it fixed.

Cheers, Shengwei

fbreitwieser commented 5 years ago

Should be fixed now

jarrodscott commented 4 years ago

for the record I posted a similar issue on the #37 thread. Yet it looks like the library/ still contains the problem files. Should I be concerned by this or just erase and continue?

archaea/
bacteria/
fungi/
nt-archaea-dustmasked.fna
nt-archaea.fna
nt-archaea.fna.map
nt-bacteria-dustmasked.fna
nt-bacteria.fna
nt-bacteria.fna.map
nt-fungi-dustmasked.fna
nt-fungi.fna
nt-fungi.fna.map
nt-protozoa-dustmasked.fna
nt-protozoa.fna
nt-protozoa.fna.map
nt-viral-dustmasked.fna
nt-viral.fna
nt-viral.fna.map
protozoa/
viral/
jarrodscott commented 4 years ago

Hi @housw

could you please tell me what other files you removed in addition to the all the non-duskmasked fasta and map files? My directory structure is a little different and I do not know what @fbreitwieser means by "remove all the files in the top level db directory". If I do that there will be nothing left :)

MinLuke commented 1 year ago

I am experiencing the same issue ( thus hundreds of kmer found in sequence w/ taxid that is not in database that is not in database) but I do not have any no-dustmasked fna in my library. Here is my library folder:

nt-archaea-dustmasked.fna nt-archaea.fna.map nt-bacteria-dustmasked.fna nt-bacteria.fna.map nt-fungi-dustmasked.fna nt-fungi.fna.map nt-parasitic_worms-dustmasked.fna nt-parasitic_worms.fna.map nt-protozoa-dustmasked.fna nt-protozoa.fna.map nt-viral-dustmasked.fna nt-viral.fna.map

So why this happened?