DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 266 forks source link

Perfect k-mer matches on different species #769

Open bartns opened 8 months ago

bartns commented 8 months ago

To understand what is happening in my actual dataset, I generated perfect insilico reads (1M paired end (2x500k 125bp)) from a (bacterial) genome and used those as input for kraken2. The confidence threshold set to 1.0 and used the nt database. The genome used to generate the reads is in the database.

Not many classified hits (~17%) with this threshold. And the majority species of classified hits (<0.05%) are actually to another species (0.04%). Naturally, percentages increase when lowering the confidence threshold but on species level most reads are still classified to the other. (Note, that these species are very similar... (and maybe even genome quality is what makes the difference))

But some reads even have all 91 k-mers classified to this other species (both pairs) . How does this happen when I know that the reads (for sure also) come from another genome? Why does kraken "prefer" the other species?

Using the (way) smaller 8GB standard database... (same species are also present in this one) Things go a bit more as expected. Reported classification start to work below a confidence of 0.2 and there seems to be less "preference" for the other species. The previous perfect matching reads now have k-mers to either Unclassfied or Genus level. There are actually no perfect 91 k-mer matches to any species (only to unclassified)

I don't have complete understanding on how kraken2 works.. so maybe this approach doesn't make sense :/

jenniferlu717 commented 6 months ago

Are the two genomes you are comparing in the same genus?

The similarities might be too close for Kraken to distinguish between the two. It has a scoring mechanism that is described in the first kraken paper but I am uncertain as to how to explain this further.

I would prefer the standard database rather than the nt database.