Open fm361 opened 3 years ago
Hello,
After browsing the open/closed issues, I think my question is basically the same as @fm361; basically, I'm wondering how within the same run (they are in the same input paired FASTQ forward and reverse query) these sequences are classified:
C MISEQ-M02326R:42:000000000-ADKKC:1:1101:8466:3541 9606 250|250 0:216 |:| 0:185 9606:7 0:24
C MISEQ-M02326R:42:000000000-ADKKC:1:1101:13164:2854 9606 250|250 0:216 |:| 0:179 9606:5 0:1 9606:7 0:24
C MISEQ-M02326R:42:000000000-ADKKC:1:1101:14229:2899 9606 250|250 0:216 |:| 0:79 9606:25 0:112
while these are not:
U MISEQ-M02326R:42:000000000-ADKKC:1:1101:14047:1973 0 250|250 0:216 |:| 0:184 9606:5 0:27
U MISEQ-M02326R:42:000000000-ADKKC:1:1101:13992:1667 0 250|250 0:216 |:| 0:200 9606:5 0:11
In all the cases, most of the k-mers are unclassified but some reads are matched to human, similar to @fm361 example.
Thank you.
I'm having difficulties understanding and tuning the confidence threshold. According to the documentation here, the confidence score is C/Q, where C is the number of k-mers supporting a particular taxon and Q is all k-mers that could be classified (not ambiguous).
From this description I would expect that as I increase the value passed to
--confidence
, the classification rate will decrease - monotonically. This does not happen. I classified 52,142,889 reads under the same conditions, only varying the confidence threshold. The results are shown below.As an example, consider this entry.
C read_id Homo sapiens (taxid 9606) 187|185 0:6 9606:53 A:94 |:| 9606:58 A:93
As far as I can understand, there are 6 unmapped k-mers, 53 mapping to human, and 94 ambiguous for the first read (human support = 53 / 59 = 89.8%), and 58 k-mers mapping to human on the second read, with 93 ambiguous (human support 100%).
This read is not classified at
confidence >= 0.40
. Why not?Thank you for your help!