DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 267 forks source link

Some sequences labeled as classified but no taxid assigned #40

Closed DerrickWood closed 4 years ago

DerrickWood commented 5 years ago

See https://github.com/DerrickWood/kraken/issues/127 for original submission. Example output line (with initial "C" but taxid 0) is below:

C M00963:162:000000000-AHPVV:1:1101:14752:1397 0 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

@stuber Can you give me some more information to help me track this down? What kind of Kraken 2 database was being run against here? What was the full command (starting with kraken2) that would replicate this output?

stuber commented 5 years ago

I included what I could for this build. It includes: archaea, bacteria, viral, human, fungi, plant, protozoa and 60 additional eukaryotic genomes, which are an additional 140GB. Total database size after cleaning is 190GB. If needed I can provide a list of genomes used.

Build command: kraken2-build --threads 60 --minimizer-spaces 0 --build --db $DBNAME

Specified "0" for minimizer-spaces because this resulted in providing specificity needed. Identifying Mycobacterium tuberculosis complex is important for my situations. Using the default minimizer-spacer value only identified tuberculosis species reads to Mycobacterium. Setting to "0" resulted in identification to tuberculosis.

Run command: kraken2 --db ${krakenDatabase} --threads ${CPUS} --paired ${read1} ${read2} --output $name-outputkraken.txt --report $name-reportkraken.txt

stuber commented 5 years ago

More database info.

$ kraken2-inspect --db $DBNAME # Database options: nucleotide db, k = 35, l = 31 # Spaced mask = 11111111111111111111111111111111111111111111111111111111111111 # Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101 # Total taxonomy nodes: 21153 # Table size: 35713390518 # Table capacity: 51025972662

DerrickWood commented 5 years ago

I don't see anything there or in the code so far that should trigger the output you're seeing. I'm thinking there may be an issue with the taxonomy processing somehow, because the code to decide whether or not to print the "C" vs "U" uses internal taxids (to speed up and simplify some processing in other parts of the software), but the code to display the taxid in the output does a lookup to find the external taxid (e.g., "9606" for human). If you try to classify this particular problem fragment with the --use-names option to kraken2, what output do you get? Did you just use kraken2-build --download-taxonomy --db $DBNAME to get the taxonomy?

stuber commented 5 years ago

With option --use-names I get: C M00963:162:000000000-AHPVV:1:1101:14752:1397 root (taxid 0) 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

I just used kraken2-build --download-taxonomy --db $DBNAME

Currently, I do not understand how Kraken works with the taxonomy files. I've looked into it a couple times, but then decide I'm just glad it all works. I can say the taxonomy numbers, 9913 and 27592, are found in nodes.dmp, and 9913 and 27592 are chosen for some reads as the taxid, so they are usable. Also using --use-names the appropriate names are shown next to the taxid.

For my use case I'll write something to provide a taxid even when there are very few identifying kmers.

Thanks,

DerrickWood commented 5 years ago

This is still a strange case. There are a few things that point to taxonomy problems here. The 0:19 0:22 0:20 in the hit list output is an error - consecutive runs of the same LCA should be collapsed per usual run-length encoding rules. That is, that should just be 0:61. My suspicion is that, at a minimum, the 0:22 is actually a hit to some taxon (call it "X") that is being misreported as taxid 0 (due to a faulty conversion between internal and external taxids), and so X is the true classification of the fragment. For some reason, though, taxon X isn't being reported out correctly here.

Note that the output you saw with --use-names - root (taxid 0) - is incorrect for a few reasons: taxid 0 doesn't exist (root is taxid 1), and unclassified fragments should have unclassified (taxid 0) output. So that's another indication of something awry.

I would be interested in seeing the actual sequence of this fragment's reads, if you're willing to share, @stuber. Also, if you still have the seqid2taxid.map file present in your database's directory, sharing it in some form (email, FTP, whatever works) would likely be helpful here as well, along with the nodes.dmp and names.dmp files it used. Thanks for your help in bringing this to my attention and tracking down the cause.

stuber commented 5 years ago

Files uploaded to FTP site. Please send me a direct email to tod.p.stuber@aphis.usda.govmailto:tod.p.stuber@aphis.usd.gov and I’ll provide login credentials.

Thanks,

On Aug 18, 2018, at 5:17 PM, Derrick Wood notifications@github.com<mailto:notifications@github.com> wrote:

This is still a strange case. There are a few things that point to taxonomy problems here. The 0:19 0:22 0:20 in the hit list output is an error - consecutive runs of the same LCA should be collapsed per usual run-length encoding rules. That is, that should just be 0:61. My suspicion is that, at a minimum, the 0:22 is actually a hit to some taxon (call it "X") that is being misreported as taxid 0 (due to a faulty conversion between internal and external taxids), and so X is the true classification of the fragment. For some reason, though, taxon X isn't being reported out correctly here.

Note that the output you saw with --use-names - root (taxid 0) - is incorrect for a few reasons: taxid 0 doesn't exist (root is taxid 1), and unclassified fragments should have unclassified (taxid 0) output. So that's another indication of something awry.

I would be interested in seeing the actual sequence of this fragment's reads, if you're willing to share, @stuberhttps://github.com/stuber. Also, if you still have the seqid2taxid.map file present in your database's directory, sharing it in some form (email, FTP, whatever works) would likely be helpful here as well, along with the nodes.dmp and names.dmp files it used. Thanks for your help in bringing this to my attention and tracking down the cause.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/DerrickWood/kraken2/issues/40#issuecomment-414089542, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFjCsQogSPgXtsM0K1dm58YhO5XvjpvTks5uSJJugaJpZM4V5TkH.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

NBaileyNCL commented 4 years ago

Just posting here to mention I've had a similar issue in Kraken 2 with sequences labelled with taxid 0 despite being in the -classified-out output

jenniferlu717 commented 4 years ago

The newest kraken2 version has fixed this issue. If any additional problems are found, please open a new issue.