DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 267 forks source link

Kraken causes potential mismatch on k-mer Level #707

Closed pfeiferd closed 1 year ago

pfeiferd commented 1 year ago

Hi,

consider the following example sequence for kraken2:

@FP200005993L1C029R04400746873 CTAATTAACTTAACAGTTTAAGTAATTTAATTAACATAG + GGGGIDFHFGEFGFEHHEEFFGHGFHHGEHHGHGBGGIH

kraken2-2.1.2 (on PowerMac) gives me:

U FP200005993L1C029R04400746873 0 39 412418:5

However, when I blast the sequence against NCBI databases (Complete genomes and Draft genomes and Complete plasmids) and 412418 using

https://blast.ncbi.nlm.nih.gov/Blast.cgi

I do not get a single hit. Is this a bug in kraken2 or am I doing something wrong?

I used the DB "Standard 8" (k2_standard_08gb_20221209) from here: https://benlangmead.github.io/aws-indexes/k2

Thanks, Daniel

pfeiferd commented 1 year ago

(I can offer more of such examples...)

AzizHN commented 1 year ago

What confidence are you working with ?

pfeiferd commented 1 year ago

What confidence are you working with ?

Its all defaults... Like this: kraken2 -db /Users/dpfeifer/dev/bioinf/kraken2-2.1.2/dbs/k2_standard_08gb_20221209 /Users/dpfeifer/problem2.fastq

(But I don't think it would matter since conficence comes into play AFTER the indiviual kmers matches have been counted...)

pfeiferd commented 1 year ago

I now also created a kraken database with just the (one) fasta file for 412418 from refseq (GCF_000019705.1_ASM1970v1_genomic.fna). Same thing happens.

AzizHN commented 1 year ago

I tried to run some experiments the last few days. and I have the same thing.

BLASTING reads done a percentage of 1253/1266 mismatches !

pfeiferd commented 1 year ago

I tried to run some experiments the last few days. and I have the same thing.

BLASTING reads done a percentage of 1253/1266 mismatches !

The kraken2 paper talks about a probabilistic hash table with an error rate of 1% - which is "generally speaking" rather high for such purposes. That woud explain miscounts which is in the range of 1%. But you mean more like 99% mismatches???

BUT: It is important to consider this on the k-mer level first. Blasting might be misleading on the level of a whole read because read errors are very common and so a blast of a rather long sequence with several errors (like a read) might not lead to any hits just for that reason.

I really hope this DOES NOT invalidate all the research and professional analysis built on top of kraken2...

pfeiferd commented 1 year ago

Has their been some (more or less rigrorous) functional testing of kraken2? (I have not read the source code and have not checked for test code there either.)

DerrickWood commented 1 year ago

Kraken 2's paper does indeed talk about a probabilistic hash table with a non-zero error rate. Note that it also discusses the fact that k-mers are NOT stored in its hash table, but rather minimizers are, and the minimizer of a k-mer is queried against the table, with the minimizer's associated taxid used as the k-mer's taxid. That is why runs of consecutive k-mers will often share a taxid.

Also note that in v2.0.9, we introduced the --minimum-hit-groups option (with a default of 2) to require multiple hit groups (a set of overlapping k-mers with the same minimizer) to exist prior to making a classification. The example in the first post here failed this criterion, so it was unclassified (see the U at the beginning of the line and 0 in the third column).

None of the data discussed in this issue is a bug in Kraken 2.