DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
711 stars 271 forks source link

Confidence threshold with paired end reads #493

Open fm361 opened 3 years ago

fm361 commented 3 years ago

I'm having difficulties understanding and tuning the confidence threshold. According to the documentation here, the confidence score is C/Q, where C is the number of k-mers supporting a particular taxon and Q is all k-mers that could be classified (not ambiguous).

From this description I would expect that as I increase the value passed to --confidence, the classification rate will decrease - monotonically. This does not happen. I classified 52,142,889 reads under the same conditions, only varying the confidence threshold. The results are shown below.

Threshold Rate Notes
0 0.5693
0.05 0.5420 lower, as expected
0.1 0.5675 higher!
0.15 0.4850
0.2 0.4366
0.25 0.3924
0.3 0.3300
0.35 0.2607
0.4 0
0.45 0
0.5 0

As an example, consider this entry.

C read_id Homo sapiens (taxid 9606) 187|185 0:6 9606:53 A:94 |:| 9606:58 A:93

As far as I can understand, there are 6 unmapped k-mers, 53 mapping to human, and 94 ambiguous for the first read (human support = 53 / 59 = 89.8%), and 58 k-mers mapping to human on the second read, with 93 ambiguous (human support 100%).

This read is not classified at confidence >= 0.40. Why not?

Thank you for your help!

antgonza commented 2 years ago

Hello,

After browsing the open/closed issues, I think my question is basically the same as @fm361; basically, I'm wondering how within the same run (they are in the same input paired FASTQ forward and reverse query) these sequences are classified:

C   MISEQ-M02326R:42:000000000-ADKKC:1:1101:8466:3541   9606    250|250 0:216 |:| 0:185 9606:7 0:24
C   MISEQ-M02326R:42:000000000-ADKKC:1:1101:13164:2854  9606    250|250 0:216 |:| 0:179 9606:5 0:1 9606:7 0:24
C   MISEQ-M02326R:42:000000000-ADKKC:1:1101:14229:2899  9606    250|250 0:216 |:| 0:79 9606:25 0:112

while these are not:

U   MISEQ-M02326R:42:000000000-ADKKC:1:1101:14047:1973  0   250|250 0:216 |:| 0:184 9606:5 0:27
U   MISEQ-M02326R:42:000000000-ADKKC:1:1101:13992:1667  0   250|250 0:216 |:| 0:200 9606:5 0:11

In all the cases, most of the k-mers are unclassified but some reads are matched to human, similar to @fm361 example.

Thank you.