fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
221 stars 44 forks source link

Consistent False Positives in Respiratory Metagenome Analysis Using KrakenUniq with the MicrobialDB Database #182

Open TLH297 opened 1 week ago

TLH297 commented 1 week ago

Hello everyone, My team and I are analyzing short read metagenome data from human respiratory samples using KrakenUniq with the MicrobialDB database (available from https://benlangmead.github.io/aws-indexes/k2) and further processing with Bracken. During our initial review, we noticed that some species, which seem implausible for a nasal respiratory metagenome, consistently appeared across all samples with identical k-mer counts. Specifically, E. coli frequently showed 300–500 k-mer counts per sample (with few exceptions of higher k-mer counts). However, when we analyzed the same dataset with MetaPhlAn, these apparent false positives were not replicated—MetaPhlAn only detected positive results in samples with higher k-mer counts flagged by KrakenUniq. Further analysis suggests that the widespread E. coli detections may be due to k-mers originating from a plasmid initially associated with E. coli. This issue is not confined to E. coli; it also affects Plantactinospora sp. BB1, Plantactinospora sp. BC1, Stenotrophomonas indicatrix, Salmonella enterica, Phocaeicola dorei, and Buchnera aphidicola. Has anyone encountered a similar issue? We would greatly appreciate any insights or shared experiences.

salzberg commented 1 week ago

You can re-build the database after removing that plasmid from E. coli and the other species, and then Kraken won't call it E. coli. But it seems that the plasmid is truly present, it's just matching other species - so you could also add the plasmid to the DB and give it a much higher-level label, like "Bacteria," and then Kraken won't call it any species. The reason MetaPhlAn works is that it only recognizes a small number of marker genes. Most genes (and most reads in a sample) aren't identified at all with that program. It counts how many reads match single-copy marker genes, maybe 1% of the genes in a genome, and then uses those counts to estimate abundance. It's a very different approach. It will miss many species entirely, of course, which is why I always recommend KrakenUniq.