DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
707 stars 271 forks source link

large discrepancy between Kraken2 and KrakenUniq counts of unique kmers #443

Open LeandroRitter opened 3 years ago

LeandroRitter commented 3 years ago

Dear Kraken2 developers,

thanks a lot for this fantastic software! I really appreciated adding the --report-minimizer-data flag (borrowed from KrakenUniq) to latest versions of Kraken2, because breadth of coverage is one of absolutely crucial validation metrics in metagenomics analysis.

I have been using KrakenUniq a lot with the threshold (suggested by KrakenUniq authors) kmers=1000 for filtering out false positive hits. This threshold worked very well in several projects and never showed obvious failures to my experience. However, I noticed a huge discrepancy between the "number of unique kmers" reported by KrakenUniq and the "number of distinct minimizers" reported by Kraken2 in the 5th column of report. I.e. not only the threshold "number of distinct minimizers"=1000 applied to Kraken2 output resulted in many false-positive hits, but also (according to my testing) for majority of microbes Kraken2 overestimates the # of unique kmers compared to KrakenUniq, but still for quite many microbes Kraken2 underestimates the # of unique kmers compared to KrakenUniq. So I could not figure out a simple conversion rule of breadth of coverage metrics from KrakenUniq to Kraken2.

The question I would like to ask is whether you have any recommendations on what threshold for the "number of distinct minimizers" (5th column of Kraken2 output) would be appropriate for removing false-positives? I would also appreciate your comments on the comparison of breadth of coverage metrics between KraenUniq and Kraken2, and whether you have an idea about what would be the reason for such a large discrepancy in breadth of coverage metrics. Thanks!

Best, Nikolay


Nikolay Oskolkov, PhD Bioinformatician, SciLifeLab Bioinformatics Long-term Support (WABI) www.nbis.se

Biology Department, Lund University Sölvegatan 35 , 22362 Lund

Phone: 0761463349 E-mail: nikolay.oskolkov@scilifelab.se


YiJessePi commented 2 years ago

Hey Nikolay, Do you have any input on that?

higaredavm commented 1 year ago

@LeandroRitter hi, did you receive or have any update about your question?. Im also trying to filter out false-positives using 5 column of kraken2 output

douglasadamoski commented 1 year ago

Dear all, I apologize for revisiting this topic, but I believe it's pertinent. In the Krakenuniq paper, two distinct thresholds are discussed: one relates to k-mer/per million reads (e.g., "We observed that the optimal thresholds increased by approximately 2000 unique k-mers for every 1 million reads"), and the other is a straightforward read-kmer threshold (e.g., "A read count threshold of 10 and a unique k-mer count threshold of 1000 significantly reduced background identifications"). However, the sample tables provided don't seem to include the first metric (k-mer per million reads).

Combining both data points might offer a more comprehensive evaluation of the datasets.