How did you establish a threshold of 2?

ghost commented 1 year ago

Hello I have read in several threads here that a rule of thumb is to cutoff the quality score at 2. But, unless you did something different than other callers, isn't the quality (QUAL) a phread scale and therefore a value of 2 means an error probability of >0.5? Can you please clarify? Or is it your realised the Phread score is not suitable for long read application? (I am aware all those scores relies on underlying assumptions that can turn out to be wrong). I am not knowledgeable enough to go understand the code. So, please can you clarify how this is possible? I have read the paper and the Fig 2 shows indeed already shows a jump in correct call but there are still wrong calls in significant proportion. But this figure is about the pileup. So, is it that the neural net can make use of pileup qual of 2 and filter out wrong calls? But at the same time the Fig 2A suggests a threshold of 15 to me 🤔 do I interpret correctly your figure? Thanks so much and sorry if it's a stupid question but I am quite confused. I have a presentation in a few weeks and people will ask me "how are you sure your filters are not too stringent" (to give you some context).

Thanks again. Alex

EDIT: here is your answer advising 2. https://github.com/HKU-BAL/Clair3/issues/116#issuecomment-1152826121

aquaskyline commented 1 year ago

Phred score '2' is ~63% being incorrect. Different from the older variant callers (like GATK unifiedgenotyper and SOAPsnp) that uses hand-crafted Bayesian models, many recent variant callers use neural networks. Bayesian models are much interpretable, and the QUAL can be simply considered as the posterior probability based on observations and priors. Neural networks, however, make QUAL harder to understand and not easily comparable between tools. Different neural network based variant callers calculate QUAL differently, but generally, one can consider QUAL is calculated as 'the chance the neural network thinks the variant candidate is wrong according to the samples that used to train the neural network'. While the training samples used by different variants callers are very different, the QUAL are not easily comparable between callers. So for different callers, the QUAL cutoff is empirically determined. One can benchmark against GIAB truth variants, and plot a curve using quality cutoff against F1-score. A sensible cutoff would be at the point with the largest change in slope. One can also plot against precision or recall if higher precision or recall is preferred in different usage scenarios. 2 is what we empirically determined for Clair3.

In Fig2A, the cutoff 16 is when only the pileup model is used. When running Clair3 in full mode, low quality variant candidates (say < 16) will be sent to full-alignment model. Empirically, we determined that a sensible cutoff for full-alignment model is 2. While the high-quality pileup calls and all full-alignment calls are combined for output, and there is no low-quality pileup call left (i.e., no pileup calls with QUAL < say 16), the overall cutoff is 2.

2 is just a general cutoff. One can choose to use different cutoffs for SNP and Indel. And one can determine their own cutoffs using some in-house benchmarks.

ghost commented 1 year ago

Thank you, truly fascinating!

HKU-BAL / Clair3

How did you establish a threshold of 2? #234