Confidence threshold with paired end reads

DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system

MIT License

711 stars 271 forks source link

I'm having difficulties understanding and tuning the confidence threshold. According to the documentation here, the confidence score is C/Q, where C is the number of k-mers supporting a particular taxon and Q is all k-mers that could be classified (not ambiguous).

From this description I would expect that as I increase the value passed to --confidence, the classification rate will decrease - monotonically. This does not happen. I classified 52,142,889 reads under the same conditions, only varying the confidence threshold. The results are shown below.

Threshold	Rate	Notes
0	0.5693
0.05	0.5420	lower, as expected
0.1	0.5675	higher!
0.15	0.4850
0.2	0.4366
0.25	0.3924
0.3	0.3300
0.35	0.2607
0.4	0
0.45	0
0.5	0

As an example, consider this entry.

C read_id Homo sapiens (taxid 9606) 187|185 0:6 9606:53 A:94 |:| 9606:58 A:93

As far as I can understand, there are 6 unmapped k-mers, 53 mapping to human, and 94 ambiguous for the first read (human support = 53 / 59 = 89.8%), and 58 k-mers mapping to human on the second read, with 93 ambiguous (human support 100%).

This read is not classified at confidence >= 0.40. Why not?

Thank you for your help!

C MISEQ-M02326R:42:000000000-ADKKC:1:1101:8466:3541 9606 250|250 0:216 |:| 0:185 9606:7 0:24 C MISEQ-M02326R:42:000000000-ADKKC:1:1101:13164:2854 9606 250|250 0:216 |:| 0:179 9606:5 0:1 9606:7 0:24 C MISEQ-M02326R:42:000000000-ADKKC:1:1101:14229:2899 9606 250|250 0:216 |:| 0:79 9606:25 0:112

DerrickWood / kraken2

Confidence threshold with paired end reads #493