broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Somatic contamination downsampling #814

Closed davidbenjamin closed 7 years ago

davidbenjamin commented 7 years ago

Our contamination downsampling code is somewhat intelligent. AlleleBiasedDownsamplingUtils contains methods to downsample the most suspicious reads up to a given contamination fraction. However, its criterion for suspicion is a non-diploid allele fraction. AlleleBiasedDownsamplingUtils::scoreAlleleCounts uses the following metric of goodness, with smaller values being considered better:

Math.min(maxCount - nextBestCount + remainderCount, Math.abs(nextBestCount + remainderCount))

where maxCount, nextBestCount, and remainderCount are the number of reads supporting the most common allele, the next most common allele, and all other alleles. This score basically says "if it's heterozygous, return the difference between ref and alt counts. If it's homozygous, return the number of reads not supporting the unique true allele." The problem for somatic calling is that any time there is a somatic variant with low allele fraction this criterion is optimized by assuming homozygosity and downsampling only the variant.

It could be that instead of implementing somatic downsampling we should give up on distinguishing contamination from somatic variation and simply filter out calls whose allele fraction is comparable to the contamination. But perhaps there's something more clever we could do. @takutosato @ldgauthier @LeeTL1220 do you have thoughts?

ldgauthier commented 7 years ago

image Based on Chip's screenshot that I think was included with the old issue, the M1 contamination handling seems to do something similar to that proposal and that was the version he was happier with.

davidbenjamin commented 7 years ago

Closing this, because Chip is right. M1 is right, and the correct approach is not to downsample. We should just pass the contamination to the Mutect filtering engine and let it decide. A reasonable place to start is to simply filter variants with allele fraction less than the contamination, but we can later take it a step further and account for the population allele frequency. For example, if contamination is 0.01 and we have a exonic variant with allele fraction 0.01 (I'm thinking ahead to cfDNA here so assume the depth is really high) that's not in ExAC, we know that it's population allele frequency can't be much more than one in a million. And the chances of it being homozygous alt in the contaminant are really tiny. Therefore, we would not throw out this variant.