AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
128 stars 57 forks source link

Signal/Noise (highQualityToLowQualityRatio) calculation misleading when no low/bad qual reads are present #381

Open rollf opened 1 year ago

rollf commented 1 year ago

Hi,

the docs say

image

I don't understand the 0.5 in the formula and it seems to me this is not consistent with the actual implementation:

tvref.highQualityToLowQualityRatio = hicnt / (locnt != 0 ? locnt : 0.5d);

The above means that the ratio would be 0.702 for 73 high/good qual reads and 104 low/bad qual reads (73/104=0.7019....). On the other hand, if there are no low/bad qual reads (i.e. locnt == 0, this can be achieved with -q 0), the formula becomes hicnt / 0.5 which is effectively hicnt * 2. So in the example numbers mentioned above we'd have 73 + 104 = 177 high/good qual reads (i.e. the 104 became 'good' now) and thus a ratio of 177 * 2 = 354.0! The ratio will be different for all variants having different numbers of reads (on that position). Conceptually, the formula could be

tvref.highQualityToLowQualityRatio = locnt == 0 ? Double.POSITIVE_INFINITY : hicnt / locnt;

I'm not saying this is how it should be but this would be how I understand the result. In any case, I guess the docs could be improved.

(Note that the implementation is present in createInsertion() as well as createVariant())