DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
731 stars 274 forks source link

Ambiguous bases being included in confidence score calculation #878

Open mcolpus opened 2 months ago

mcolpus commented 2 months ago

In the documentation it says that ambiguous bases are excluded for the purpose of calculating confidence score. But I've been getting some slight errors when running kraken2, which led me to try doing some simple print statements.

Example: I ran kraken2 with --confidence 0.1 and got a read classified as follows

U   600 0   151|151 0:41 1:1 0:75 |:| 0:44 A:34 570:3 0:30 9606:3 0:3

Which has counts: 0: 193 1: 1 570:3 9606: 3 A: 34 so total is 200, or 234 with Ambiguous bases included.

I put print statement in classify.cc at line 439 in ResolveTree

fprintf(stderr, "Required score: %u\n", required_score);
fprintf(stderr, "Total minimizers: %lu\n", total_minimizers);
fprintf(stderr, "conf threshold: %f\n", opts.confidence_threshold);
for (auto &kv_pair : hit_counts) {
  fprintf(stderr, "Taxon %lu: %lu\n", kv_pair.first, kv_pair.second);
}

The output is:

Required score: 24
Total minimizers: 234
conf threshold: 0.100000
Taxon 45471: 3
Taxon 12739: 3
Taxon 1: 1

I'm not sure why the taxon are different (I assume a different number is used internally). But the required score is clearly using the total number of kmers which includes A's.

Is this intended or a bug?

mcolpus commented 1 month ago

On consideration. I think it I prefer the way it is currently done in the code (including ambiguous kmers in total count), so more sensible to change the docs