Closed tb08 closed 6 years ago
Thanks for the question. The original idea behind this filter is that VarScan was letting through high quality (low SPV), high frequency variants that were all false positives, and this helped remove them. I agree that it's a non-intuitive result, but helped remove lots of false positives in validations.
Are you finding issues with this filter? We haven't explored and done much work with VarScan in a while since we typically use other callers, so happy to explore removing or improving this filter. Thanks again for the feedback.
Hello Brad, Thanks for the explanation. From what I have been able to see, this filter indeed removes a large number of likely germline variants that are poorly covered (which is probably why they have low or zero VAF in the normal due to random sampling and probably why their SPV manages to reach the 5% significance threshold). One typical example:
GT:AD:DP:DP4:FREQ:RD 0/0:1:10:9,0,1,0:0.1:9 0/1:7:10:3,0,7,0:0.7:3 with SPV=0.0098833 and FREQ=0.7
However it also removes a number of high quality, high VAF somatic variants that are for example kept in the Mutect2 VCF and some of which are known cancer variants. For example:
GT:AD:DP:DP4:FREQ:RD 0/0:0:77:40,37,0,0:0:77 1/1:43:56:8,5,28,15:0.7679:13 with SPV=1.1406e-23 and FREQ=0.7679 GT:AD:DP:DP4:FREQ:RD0/0:0:535:384,151,0,0:0:535 1/1:269:352:65,18,209,60:0.7642:83 with SPV=0 and FREQ=0.7642
Typically those have much smaller SPV than the first ones.
If I am not mistaken, actually any well supported variant with a VAF >35% will be excluded by this filter. True somatic variants with such high VAF will be common in samples with high tumor cellularity / cell lines. Moreover, even if the tumor cellularity does not approach 70%, such high VAF will still routinely and genuinely happen: for example double hits with one allele deleted and the other one mutated, or variants on the X chromosome in males. I think I would therefore drop this filter, maybe replace it with some post filtering based on coverage/read counts. I see indeed that you did not include varscan among the callers in your best-practice yaml and I just noticed that you introduced Strelka2, I should give it a try. Thanks again.
Thanks much for this feedback. I'm agreed with your assessment and don't want to remove high quality variants in samples with high cellularity and few subclones. While removing VarScan false positives is still a goal, this seems overly aggressive so we'll remove from the standard workflow.
In general, I agree with your approach of looking at strelka2 instead of VarScan. We've been using VarDict, Strelka2 and MuTect2 most often for somatic calling, and welcome feedback on those as well if you run into any issues.
Thank you again.
Hello bcbio folks, I am having trouble understanding the rationale behind the SpvFreq filter in varscan.py: It says:
I don't understand why one would want to filter such variants which to me seem like very convincing ones. Shouldn't the filter instead be: SPV > 0.05 and tumor FREQ >0.35 ? My understanding is that the SPV is a Fisher p-value comparing normal and tumor reads and that the smaller the p-value the more convincing the supporting read counts are. I would understand not trusting high VAF variants with low coverage, but wouldn't those tend to have less significant, hence bigger Somatic P Values? Or am I missing something? Best regards!