DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
727 stars 273 forks source link

Unclassified reads have blast hits that match lineages in the kraken2 database #343

Open charlesfoster opened 4 years ago

charlesfoster commented 4 years ago

Hi,

I'm trying out using kraken2 for the first time. I have short (150 bp) sequencing reads that are ostensibly of SARS-CoV-2, but as part of QC I wish to classify them using kraken2 to see if there are any contaminants. I run kraken2 with the minikraken2 database like so:

kraken2 --threads 8 --gzip-compressed fastq/READS1.fastq.gz fastq/READS2.fastq.gz --db /path/to/minikraken2_v2_8GB_201904_UPDATE --output test.kraken --report test_kraken.report --paired --use-names

The output is:

436895 sequences (131.94 Mbp) processed in 7.533s (3479.9 Kseq/m, 1050.92 Mbp/m). 82033 sequences classified (18.78%) 354862 sequences unclassified (81.22%)

The classified sequences appear to check out. For example, I'll blast reads assigned "Homo sapiens (taxid 9606)" against the nt database and they'll get human hits; likewise for reads assigned to viruses/SARS-CoV-2. However, the unclassified sequences (81.22% of them) do not equally make sense. I'll randomly select unclassified reads and every time there are perfect matches to SARS-CoV-2 genomes in the nt database.

Since I'm new to kraken2, it's possible I'm missing something here, but shouldn't these unclassified reads be classified as SARS-CoV-2, or at least to some viral lineage? Are there some settings I'm missing?

Thanks, Charles

jenniferlu717 commented 4 years ago

Its likely because you're using the minikraken database.

In order for the database to be condensed into 8GB, the minikraken database subsamples all the kmers from the full database. So kmers are missing in the minikraken database.

It is possible that a full database will cause more reads being classified.