DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
711 stars 271 forks source link

Possible issue with distinct minimizer count in combination with confidence #630

Open danisven opened 2 years ago

danisven commented 2 years ago

I've classified the same dataset against the same database with the only difference being that i've used different confidence score settings. I've run with confidence 0.0 (default), 0.1, 0.2, 0.5, and 0.9.

The number of classified reads drop with increased confidence score, as expected. However, the numbers in the columns representing the total number of minimizers in the read data and the total number of distinct minimizers in the read data (columns 4 and 5, respectively) are identical between classifications as far as I can see.

I would have expected the number of minimizers (both total and unique) to drop when the number of classified reads drop. Have I perhaps misunderstood something about the output or could this be a bug?

I'm attaching the report files from the classifications so you can have a look yourself.

I'm running version 2.1.2.

Cheers, /Daniel

confidence_0p0_report.txt confidence_0p1_report.txt confidence_0p2_report.txt confidence_0p5_report.txt confidence_0p9_report.txt

Midnighter commented 1 year ago

Hi @danisven,

I'm not one of the authors but the way I understand the minimizer reported two possible explanations are:

  1. Even though the number of sequencing reads mapped decreases due to the confidence parameter, there are still enough sequences that they span the same minimizer space (and distinct minimizers).
  2. Another possibility is that the minimizers are reported before the confidence threshold is applied which would be unintuitive to say the least.