Build database - Githubissues

gexijin commented 5 years ago

I followed the new README file to build the database but the last command give me an error: krakenuniq-download --db DBDIR --dust microbial-nt Unknown database microbial-nt.

gexijin commented 5 years ago

I also got a lot of error messages like this: Error! Didn't find taxonomy ID mapping for sequence DQ193275.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193276.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193277.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193278.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193279.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193280.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193281.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193282.1!! Error! Didn't find taxonomy ID mapping for sequence DQ193283.1!!

dcdanko commented 5 years ago

I'd like to note I am getting a similar error message.

Command:

$ krakenuniq-build --db DB --taxids-for-genomes --taxids-for-sequences --threads 6 --kmer-len 31 --max-db-size 80 2> foo
Kraken build set to minimize disk writes.
Found 12262 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, database reduction already done.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Building  KrakenUniq LCA database (step 6 of 6)...
 Adding taxonomy IDs for sequences
 Adding taxonomy IDs for genomes

Error message

$ cat foo | head
Found jellyfish v1.1.12
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (72.000 GB) ... Done
Loaded database with 6442450942 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 23884 mappings.
kmer found in sequence w/ taxid 1000002792 that is not in database
kmer found in sequence w/ taxid 1000002792 that is not in database
kmer found in sequence w/ taxid 1000002792 that is not in database
kmer found in sequence w/ taxid 1000002792 that is not in database
kmer found in sequence w/ taxid 1000002792 that is not in database

Download command ( in case it's important)

$ krakenuniq-download --db DB --dust refseq/bacteria

Version

$ krakenuniq --version
KrakenUniq version 0.5.7
Copyright 2017-2018, Florian Breitwieser (fbreitwieser@jhu.edu)
Copyright 2013-2017, Derrick Wood (dwood@cs.jhu.edu) for Kraken

y-mone commented 5 years ago

Hello, I get a similar message when I run kraken-build command. Command: krakenuniq-download --db ${DBDIR} --dust --threads 16 refseq/viral/Any viral-neighbors /krakenuniq-build \ -jellyfish-bin /opt/users/yahb/program_dir/jellyfish-1.1.11/bin/jellyfish --db ${DBDIR} --kmer-len 31 --threads 16 --taxids-for-genomes --taxids-for-sequences

I get error messages:

Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2046891 taxa perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). Reading taxonomy index from taxDB. Done. Getting database0.kdb into memory (571.32 MB) ... Done Loaded database with 49922790 keys with k of 31 [val_len 4, key_len 8]. Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 38554 mappings. Error! Didn't find taxonomy ID mapping for sequence MF319186.1!! Error! Didn't find taxonomy ID mapping for sequence KC285152.2!! Error! Didn't find taxonomy ID mapping for sequence KP862744.1!! Error! Didn't find taxonomy ID mapping for sequence KT779557.1!! Error! Didn't find taxonomy ID mapping for sequence KF944111.3!! Error! Didn't find taxonomy ID mapping for sequence KF895841.2!! Error! Didn't find taxonomy ID mapping for sequence KF944110.2!! Error! Didn't find taxonomy ID mapping for sequence MH979230.1!! Error! Didn't find taxonomy ID mapping for sequence MH979229.1!! Error! Didn't find taxonomy ID mapping for sequence KY486271.1!!

Please, could you hel me to fix this ?

Thank you.

fbreitwieser commented 5 years ago

Hi @Yahb4 and @gexijin , can you try with the latest version, v0.5.8? There was an issue with versions versus unversioned mappings (MF319186.1 vs MF319186).

@dcdanko , please also try with the latest version. There was a bug in the build_db script that resulted into waiting for stdout

larssnip commented 5 years ago

I installed krakenuniq by singularity, pulled the container from https://quay.io/organization/biocontainers. Latest Version is 0.5.7. It runs (krakenuniq-download produces help-text), but I cannot even download the taxonomy (Fetch failed). Is it a bad idea to install krakenuniq as a container? This is a Computing cluster, and I need the admin to install it "properly".

VadimDu commented 5 years ago

Hi everyone,

I have downloaded and built the db with version 0.5.5, and it went OK after adjusting the hash with Jellyfish. Here are the exact commands I have used:

krakenuniq-download --db $REFDIR/DB/ --taxa "archaea,bacteria,fungi" --dust --exclude-environmental-taxa nt krakenuniq-build -db KrakenUniq/DB/ --jellyfish-hash-size 10000M --threads 32 --taxids-for-genomes --taxids-for-sequences --max-db-size 300

Hope it might help Vadim

skennedy8 commented 5 years ago

I have an issue with the RAM requirement, that I can get around by setting '--jellyfish-hash-size.' However, this is not ideal and I was hoping to use the 'work-on-disk' option. The problem is that the requested RAM is the same with or without this option. I am reconstructing different versions of NCBI's RefSeq, v70 in this case, which explains the high RAM requirement. I would prefer not to have to limit the hash size, so any suggestions would be helpful.

Found jellyfish v1.1.11
Kraken build set to minimize RAM usage.
Finding all library files
Found 1 sequence files (*.{fna,fa,ffn}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '133990910241'
terminate called after throwing an instance of 'jellyfish::invertible_hash::ErrorAllocation'
  what():  Failed to allocate 687194767360 bytes of memory

bhurwitz33 commented 5 years ago

In the KrakenUniq paper, supplemental 1, section 2.1 there is a makefile noted for building the reference database from the nt database. Can you give me a link for this makefile? I need the commands listed there to properly build the database.

kubu4 commented 5 years ago

Trying to build database (krakenuniq-0.5.8), but getting the errors seen by @gexijin.

krakenuniq installed with following command:

install_krakenhll.sh .

Taxonomy downloaded with the following command (dustmasker, needed for "dusting" was copied from /gscratch/srlab/programs/ncbi-blast-2.8.1+_orginal/bin/dustmasker):

krakenuniq-download --db /gscratch/srlab/data/kraken_dbs/ --dust microbial-nt

Database built with the following command:

krakenuniq-build --db /gscratch/srlab/data/kraken_dbs/ \
--jellyfish-bin /gscratch/srlab/programs/jellyfish-1.1.11/bin/jellyfish \
--kmer-len 31 \
--threads 27 \
--taxids-for-genomes \
--taxids-for-sequences

Error messages:

Error! Didn't find taxonomy ID mapping for sequence FJ477661.1!!
Error! Didn't find taxonomy ID mapping for sequence FJ551525.1!!
Error! Didn't find taxonomy ID mapping for sequence FJ552035.1!!
Error! Didn't find taxonomy ID mapping for sequence AM935491.1!!
Error! Didn't find taxonomy ID mapping for sequence AM935172.1!!
Error! Didn't find taxonomy ID mapping for sequence AM935186.1!!
Error! Didn't find taxonomy ID mapping for sequence AM935301.1!!
Error! Didn't find taxonomy ID mapping for sequence AM935344.1!!

Any suggestions on how to proceed?

EDITED: Put actual DBDIR in download command.

jarrodscott commented 4 years ago

Not sure if a) this issue was ever addressed or b) whether my issue is related I am encountering a similar issue. I am running KrakenUniq version 0.5.8.

I have a 500Gb log file with nothing but lines line this:

kmer found in sequence w/ taxid 1000033476 that is not in database

Here are the commands Im running:

krakenuniq-download --db DB --taxa "archaea,bacteria,viral,fungi,protozoa" --dust --exclude-environmental-taxa nt
krakenuniq-download --db DB --taxa "archaea,bacteria,viral,fungi,protozoa" --dust --exclude-environmental-taxa refseq/bacteria refseq/archaea re$
krakenuniq-build --db DB --kmer-len 31 --threads 2 --taxids-for-genomes --taxids-for-sequences --jellyfish-hash-size 10000M --max-db-size 300

And these are the files so far...

-rw-rw-r-- 1   62G Nov 20 12:25 nt.fna.gz
-rw-rw-r-- 1  242G Nov 22 09:40 nt.fna
drwxrwxr-x 7  4.0K Nov 22 17:01 library
drwxrwxr-x 2  4.0K Nov 22 17:36 taxonomy
-rw-rw-r-- 1  5.8M Nov 22 18:01 library-files.txt
-rw-rw-r-- 1  138G Nov 22 19:49 database_0
-rw-rw-r-- 1  134G Nov 22 21:33 database_1
-rw-rw-r-- 1  138G Nov 22 23:56 database_2
-rw-rw-r-- 1  138G Nov 23 01:33 database_3
-rw-rw-r-- 1  136G Nov 23 02:54 database_4
-rw-rw-r-- 1  134G Nov 23 05:17 database_5
-rw-rw-r-- 1  136G Nov 23 06:46 database_6
-rw-rw-r-- 1   73G Nov 23 07:21 database_7
-rw-rw-r-- 1  357G Nov 23 09:08 database.jdb.big
-rw-rw-r-- 1  293G Nov 23 09:35 database.jdb
-rw-rw-r-- 1   279 Nov 23 09:35 database-build.log
-rw-rw-r-- 1  8.1G Nov 23 10:43 database.idx
-rw-rw-r-- 1  293G Nov 23 15:49 database0.kdb
-rw-rw-r-- 1  187M Nov 23 15:51 seqid2taxid.map
-rw-rw-r-- 1  105M Nov 23 15:52 taxDB
-rw-rw-r-- 1  105M Nov 23 15:52 taxDB.orig
-rw-rw-r-- 1  226M Nov 23 15:57 seqid2taxid-plus.map

jarrodscott commented 4 years ago

Actually looks like my issue is related to #44

watsonar commented 4 years ago

Hello, I am getting similar errors to those described here. I am using KrakenUniq version 0.5.7, and I downloaded files and built a database with the following commands:

$ krakenuniq-download --db . taxonomy

$ krakenuniq-download --db . refseq/vertebrate_mammalian/Chromosome/species_taxid=9606 --threads 35

$ krakenuniq-download --db . refseq/bacteria refseq/archaea refseq/viral/Any viral-neighbors --threads 35

$ krakenuniq-download --db . viral-neighbors --threads 35

$ krakenuniq-build --threads 39 --db . --taxids-for-genomes

And while my database build does finish, I have 8,000 lines in my log file that look like this:

Error! Didn't find taxonomy ID mapping for sequence JF490325.1!!

I looked up a handful of these IDs on NCBI, and from my sampling they all seem to correspond to viral taxa. I'm not too worried about viral representation in my database, but I am still hesitant to proceed with my analysis.

According to the log,

123288 sequences (66321.18 Mbp) processed in 633745.047s (0.0 Kseq/m, 6.28 Mbp/m).
  110780 sequences classified (89.85%)
  12508 sequences unclassified (10.15%)

Is there any way to get the percent sequences classified for each group downloaded (bacteria, archaea, viral, etc.)? And in general how can I circumvent these errors?

Thank you so much in advance for any help. :)

fbreitwieser / krakenuniq

Build database #37