bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Question about SpvFreq filter for varscan in paired tumor-normal mode #2345

Closed tb08 closed 6 years ago

tb08 commented 6 years ago

Hello bcbio folks, I am having trouble understanding the rationale behind the SpvFreq filter in varscan.py: It says:

"""Filter VarScan calls based on the SPV value and frequency. Removes calls with SPV < 0.05 and a tumor FREQ > 0.35. False positives dominate these higher frequency, low SPV calls. They appear to be primarily non-somatic/germline variants not removed by other filters. """

I don't understand why one would want to filter such variants which to me seem like very convincing ones. Shouldn't the filter instead be: SPV > 0.05 and tumor FREQ >0.35 ? My understanding is that the SPV is a Fisher p-value comparing normal and tumor reads and that the smaller the p-value the more convincing the supporting read counts are. I would understand not trusting high VAF variants with low coverage, but wouldn't those tend to have less significant, hence bigger Somatic P Values? Or am I missing something? Best regards!

chapmanb commented 6 years ago

Thanks for the question. The original idea behind this filter is that VarScan was letting through high quality (low SPV), high frequency variants that were all false positives, and this helped remove them. I agree that it's a non-intuitive result, but helped remove lots of false positives in validations.

Are you finding issues with this filter? We haven't explored and done much work with VarScan in a while since we typically use other callers, so happy to explore removing or improving this filter. Thanks again for the feedback.

tb08 commented 6 years ago

Hello Brad, Thanks for the explanation. From what I have been able to see, this filter indeed removes a large number of likely germline variants that are poorly covered (which is probably why they have low or zero VAF in the normal due to random sampling and probably why their SPV manages to reach the 5% significance threshold). One typical example:

GT:AD:DP:DP4:FREQ:RD 0/0:1:10:9,0,1,0:0.1:9 0/1:7:10:3,0,7,0:0.7:3 with SPV=0.0098833 and FREQ=0.7

However it also removes a number of high quality, high VAF somatic variants that are for example kept in the Mutect2 VCF and some of which are known cancer variants. For example:

GT:AD:DP:DP4:FREQ:RD 0/0:0:77:40,37,0,0:0:77 1/1:43:56:8,5,28,15:0.7679:13 with SPV=1.1406e-23 and FREQ=0.7679 GT:AD:DP:DP4:FREQ:RD0/0:0:535:384,151,0,0:0:535 1/1:269:352:65,18,209,60:0.7642:83 with SPV=0 and FREQ=0.7642

Typically those have much smaller SPV than the first ones.

If I am not mistaken, actually any well supported variant with a VAF >35% will be excluded by this filter. True somatic variants with such high VAF will be common in samples with high tumor cellularity / cell lines. Moreover, even if the tumor cellularity does not approach 70%, such high VAF will still routinely and genuinely happen: for example double hits with one allele deleted and the other one mutated, or variants on the X chromosome in males. I think I would therefore drop this filter, maybe replace it with some post filtering based on coverage/read counts. I see indeed that you did not include varscan among the callers in your best-practice yaml and I just noticed that you introduced Strelka2, I should give it a try. Thanks again.

chapmanb commented 6 years ago

Thanks much for this feedback. I'm agreed with your assessment and don't want to remove high quality variants in samples with high cellularity and few subclones. While removing VarScan false positives is still a goal, this seems overly aggressive so we'll remove from the standard workflow.

In general, I agree with your approach of looking at strelka2 instead of VarScan. We've been using VarDict, Strelka2 and MuTect2 most often for somatic calling, and welcome feedback on those as well if you run into any issues.

Thank you again.