Closed housw closed 5 years ago
Hi @housw , are you using the DB shrinking option? If yes, those warnings are because of that, and I'll fix it.
Hi Florian,
no, I'm not. I first downloaded the microbial_nt using the command: krakenuniq-download --db microbial_nt --dust microbial-nt
, then tried to build the database using the command: krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences
.
However, when I try to reproduce the error on another server, I haven't got those warnings yet. I'm afraid this is related to the configuration of our sever, I will keep investigating on it and keep you updated.
Here is the standard output from another server, it looks normal:
microbial_nt/taxonomy/nodes.dmp check [133.24 MB]
Extracting names file [tar -C microbial_nt/taxonomy -zxvf microbial_nt/taxonomy/taxdump.tar.gz names.dmp 1>&2] ...names.dmp
done (took 3s)
microbial_nt/taxonomy/names.dmp check [167.97 MB]
: downloading ... gunzipping ... done.
: downloading ... done.
: downloading ... done.
Reading headers from nt file ... Got 50433364 ACs (took 23m13s).
Reading taxonomy tree from microbial_nt/taxonomy/nodes.dmp ... Got 149093 nodes (took 4s).
Reading AC to taxonomy ID mapping from microbial_nt/taxonomy/nucl_gb.accession2taxid.gz ... Done (took 7m31s).
Reading AC to taxonomy ID mapping from microbial_nt/taxonomy/nucl_wgs.accession2taxid.gz ... Done (took 12m46s).
Got mappings for 744310 taxa.
Writing microbial_nt/library/nt-bacteria.fna ... Done, wrote 6970922 sequences for 489772 taxa (took 51m53s).
microbial_nt/library/nt-bacteria.fna check [62.53 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-bacteria.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-bacteria-dustmasked.fna.tmp && mv microbial_nt/library/nt-bacteria-dustmasked.fna.tmp microbial_nt/library/nt-bacteria-dustmasked.fna] ... done (took 2h9m42s)
microbial_nt/library/nt-bacteria-dustmasked.fna check [62.62 GB]
Writing microbial_nt/library/nt-archaea.fna ... Done, wrote 350346 sequences for 13141 taxa (took 1m0s).
microbial_nt/library/nt-archaea.fna check [1.13 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-archaea.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-archaea-dustmasked.fna.tmp && mv microbial_nt/library/nt-archaea-dustmasked.fna.tmp microbial_nt/library/nt-archaea-dustmasked.fna] ... done (took 2m52s)
microbial_nt/library/nt-archaea-dustmasked.fna check [1.12 GB]
Writing microbial_nt/library/nt-viral.fna ... Done, wrote 2306462 sequences for 195811 taxa (took 7m13s).
microbial_nt/library/nt-viral.fna check [4.50 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-viral.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-viral-dustmasked.fna.tmp && mv microbial_nt/library/nt-viral-dustmasked.fna.tmp microbial_nt/library/nt-viral-dustmasked.fna] ... done (took 12m11s)
microbial_nt/library/nt-viral-dustmasked.fna check [4.47 GB]
Writing microbial_nt/library/nt-fungi.fna ... Done, wrote 4420038 sequences for 157482 taxa (took 6m46s).
microbial_nt/library/nt-fungi.fna check [10.29 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-fungi.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-fungi-dustmasked.fna.tmp && mv microbial_nt/library/nt-fungi-dustmasked.fna.tmp microbial_nt/library/nt-fungi-dustmasked.fna] ... done (took 29m28s)
microbial_nt/library/nt-fungi-dustmasked.fna check [10.26 GB]
Writing microbial_nt/library/nt-protozoa.fna ... Done, wrote 1608541 sequences for 58189 taxa (took 3m5s).
microbial_nt/library/nt-protozoa.fna check [4.38 GB]
Masking low-complexity sequences [dustmasker -infmt fasta -in microbial_nt/library/nt-protozoa.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > microbial_nt/library/nt-protozoa-dustmasked.fna.tmp && mv microbial_nt/library/nt-protozoa-dustmasked.fna.tmp microbial_nt/library/nt-protozoa-dustmasked.fna] ... done (took 26m54s)
microbial_nt/library/nt-protozoa-dustmasked.fna check [4.38 GB]
krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences
Found jellyfish v1.1.11
Kraken build set to minimize disk writes.
Finding all library files
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '31603295490'
K-mer set created. [4h43m3.044s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [9h6m14.867s]
Creating seqID to taxID map (step 4 of 6)..
15656309 sequences mapped to taxa. [4.443s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2054908 taxa
taxDB construction finished. [21.337s]
Building KrakenUniq LCA database (step 6 of 6)...
Adding taxonomy IDs for sequences
Adding taxonomy IDs for genomes
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (313.245 GB) ... Done
Loaded database with 28028721239 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 15656309 mappings.
Best, Shengwei
Hi Shengwei, I think I see the reason why it happens. Can you post the content of the microbial-nt/library
folder? The non-dustmasked files may not be removed when building the nt database.
Hi Florian,
yes, you're right, here are the files inside of microbial_nt/library
:
library $ ls -1v
nt-archaea.fna
nt-archaea.fna.map
nt-archaea-dustmasked.fna
nt-bacteria.fna
nt-bacteria.fna.map
nt-bacteria-dustmasked.fna
nt-fungi.fna
nt-fungi.fna.map
nt-fungi-dustmasked.fna
nt-protozoa.fna
nt-protozoa.fna.map
nt-protozoa-dustmasked.fna
nt-viral.fna
nt-viral.fna.map
nt-viral-dustmasked.fna
So should I remove all the non-duskmasked fasta and map files, and re-build it again?
Best, Shengwei
Hi Shengwei, yes please do that. Remove the non-dustmasked files and remove all the files in the top level db directory (rm microbial_nt/*
), and call the build command again. You can leave the issue open - I'm fixing the download script.
Hi Florian,
that's great, thanks a lot for your kind help! I'll let you close it when you've it fixed.
Cheers, Shengwei
Should be fixed now
for the record I posted a similar issue on the #37 thread. Yet it looks like the library/
still contains the problem files. Should I be concerned by this or just erase and continue?
archaea/
bacteria/
fungi/
nt-archaea-dustmasked.fna
nt-archaea.fna
nt-archaea.fna.map
nt-bacteria-dustmasked.fna
nt-bacteria.fna
nt-bacteria.fna.map
nt-fungi-dustmasked.fna
nt-fungi.fna
nt-fungi.fna.map
nt-protozoa-dustmasked.fna
nt-protozoa.fna
nt-protozoa.fna.map
nt-viral-dustmasked.fna
nt-viral.fna
nt-viral.fna.map
protozoa/
viral/
Hi @housw
could you please tell me what other files you removed in addition to the all the non-duskmasked fasta and map files? My directory structure is a little different and I do not know what @fbreitwieser means by "remove all the files in the top level db directory". If I do that there will be nothing left :)
I am experiencing the same issue ( thus hundreds of kmer found in sequence w/ taxid that is not in database that is not in database
) but I do not have any no-dustmasked fna in my library. Here is my library folder:
nt-archaea-dustmasked.fna nt-archaea.fna.map nt-bacteria-dustmasked.fna nt-bacteria.fna.map nt-fungi-dustmasked.fna nt-fungi.fna.map nt-parasitic_worms-dustmasked.fna nt-parasitic_worms.fna.map nt-protozoa-dustmasked.fna nt-protozoa.fna.map nt-viral-dustmasked.fna nt-viral.fna.map
So why this happened?
Hi Florian,
I'm building a
microbial_nt
database with this command:krakenuniq-build --db microbial_nt --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences
. And I'm getting tons of the following warning messages on the screen:Is it normal or am I doing anything wrong?
Thanks a lot, Shengwei