DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
731 stars 274 forks source link

discrepancy between Kraken results and BLAST hits #842

Open kuldeepmore10 opened 5 months ago

kuldeepmore10 commented 5 months ago

Hello Kraken team,

I am analysing shotgun data using Kraken2. So I built only protist database using refseq genomes from NCBI. I used different confidence levels (0 to 0.7) and increased --minimum-hit-groups to 4 for testing. But there is whole lot of discrepancy in Kraken classification and what I find when I BLAST the classified sequence.

For example, in 0.5 confidence level, @LH00328:56:22FFHJLT3:5:1106:14351:2442 1:N:0:GCTATCCT+AACAGGTG kraken:taxid|1093141 TTTGCCGAGTTCCTTCTCCTGAGTTCTCTCAAGCGCCTTGGAATATTCATCCCGTCCACCTGTGTCGGTTTGCGGTACGGTCTCGTACAGCTGAAGCTTAGAGGCTTTTCTTGGAACCACTTCCAATCACTTCGCGAAACAAGTTCGCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

This sequence when I BLAST, I get 100% identity to Bordetella, but Kraken classifies this as Nannochloropsis gaditana. Now this is just one example, there several others. Am I doing anything wrong? Whats the solution for this?

salzberg commented 5 months ago

this is easy: both answers are correct. Your database only has protists, so Kraken cannot recognize this as Bordetella, which is the correct ID. So it gives you the species that has at least one 31-bp exact match, even though it's not a great match otherwise.

kuldeepmore10 commented 5 months ago

Thats what I though. But does that mean that I only want to analyse protists, I will have to build all (say bacteria, archaea, plants, etc) inclusive database every-time? Furthermore, I thought Kraken had species specific k-mer and will identify the read belonging to that species only if that k-mer is detected. But that maybe my misunderstanding.

salzberg commented 5 months ago

It's true that Kraken identifies species-specific k-mers when it builds the database. But it can only do that for species that you give it at the time you build the DB. So if a k-mer is shared between a protist and a bacterium, it will be classified at the lowest common ancestor of those species. However if you only give it the protist genomes, then the k-mer could appear to be specific to one of the protists. So yes, you have to build an inclusive database. Note that our Microbial2023 database is very large and inclusive already. So you could use that first, and then take all the 'unclassified' reads from the output, and run those against a 2nd, protist-only database. Microbial2023 has a few protists in it, so you'd have to merge the results after.

kuldeepmore10 commented 5 months ago

This explains a lot. Thank you very much :) I am now building bacteria plus protozoa database. I will post here if it works out fine so that the thread can be closed :)