artic-network / artic-ncov2019

ARTIC nanopore protocol for nCoV2019 novel coronavirus
Creative Commons Attribution 4.0 International
168 stars 166 forks source link

Mean read quality calculation in guppyplex.py #50

Open OgnjenMilicevic opened 3 years ago

OgnjenMilicevic commented 3 years ago

In guppyplex.py you have a formula for mean read calculation:

def get_read_mean_quality(record): return -10 * log10((10 ** (pd.Series(record.letter_annotations["phred_quality"]) / -10)).mean())

Although this is technically correct if one wants to get the mean on the probability scale, aren't these scores meant to be averaged at the log-scale (phred scale)? This severely biases the value towards lower qualities, for example the sequence of bases [10,10,10,10,10,10,10,10,2,1] would have 8.3 score on a linear scale, but has 6.53 as a result for your calculation.

None of my reads pass your default filter of 7, for example: [4, 7, 6, 10, 3, 6, 14, 3, 3, 11, 5, 7, 5, 3, 3, 11, 6, 11, 2, 4, 4, 2, 5, 2, 3, 3, 7, 10, 3, 15, 15, 4, 3, 4, 13, 4, 4, 15, 2, 5, 8, 10, 3, 4, 3, 3, 2, 4, 5, 5, 5, 5, 2, 2, 5, 3, 3, 6, 4, 3, 2, 3, 9, 2, 5, 9, 4, 3, 4, 5, 5, 11, 10, 2, 4, 9, 2, 2, 2, 3, 5, 4, 4, 3, 10, 7, 3, 6, 5, 5, 3, 6, 10, 4, 4, 4, 4, 3, 8, 3, 9, 5, 9, 2, 7, 6, 3, 7, 4, 4, 3, 4, 2, 4, 4, 11, 4, 6, 2, 3, 3, 4, 6, 3, 4, 6, 4, 3, 5, 4, 2, 2, 4, 4, 3, 7, 2, 7, 3, 7, 8, 8, 4, 2, 8, 3, 4, 3, 4, 2, 2, 3, 10, 3, 3, 2, 7, 5, 8, 11, 3, 4, 4, 1, 2, 2, 4, 6, 2, 5, 5, 2, 3, 4, 3, 8, 2, 3, 4, 3, 4, 8, 6, 13, 13, 6, 10, 20, 17, 8, 4, 6, 7, 8, 5, 4, 3, 4, 4, 3, 3, 6, 3, 7, 6, 6, 7, 5, 3, 3, 5, 2, 3, 10, 6, 5, 3, 4, 5, 5, 7, 4, 4, 13, 7, 2, 3, 5, 2, 7, 10, 3, 4, 5, 5, 4, 10, 7, 4, 3, 8, 4, 2, 2, 1, 4, 5, 15, 6, 4, 3, 4, 3, 3, 7, 10, 4, 4, 8, 6, 5, 2, 3, 3, 3, 8, 6, 5, 6, 6, 10, 7, 11, 10, 11, 10, 8, 6, 8, 4, 6, 2, 4, 7, 8, 2, 4, 6, 3, 12, 9, 4, 10, 10, 4, 8, 4, 3, 3, 9, 12, 5, 12, 1, 3, 6, 2, 3, 1, 2, 3, 3, 2, 7, 9, 3, 9, 13, 9, 3, 8, 6, 7, 2, 8, 7, 2, 9, 7, 3, 2, 5, 3, 4, 2, 3, 2, 3, 3, 4, 4, 4, 2, 7, 2, 3, 7, 8, 3, 4, 4, 2, 5, 7, 5, 4, 4, 3, 7, 5, 7, 5, 12, 13, 3, 14, 12, 6, 8, 8, 12, 5, 4, 5, 7, 5, 8, 4, 6, 11, 8, 16, 17, 12, 3, 6, 4, 3, 3, 1, 5, 14, 20, 13, 7, 4, 3, 2, 6, 3, 2, 5, 7, 10, 3, 2, 7, 3, 6, 5, 4, 10, 5, 4, 5, 2, 3, 2, 3, 9, 3, 4, 3, 5, 4, 6, 8, 8, 10, 4, 3, 1, 4, 2, 5, 3, 5, 3, 4, 3, 5, 4]

In the meantime, this is my quality tab from MinKNOW: image

Is this intended?

MaestSi commented 3 years ago

Hi, that is actually very interesting, and to me looks like an unintended feature. Maybe it would be better to rely on quality filtering performed by tools as NanoFilt? Simone