broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Obsolete hard-coded default base quals for overlapping mates #4958

Closed davidbenjamin closed 5 years ago

davidbenjamin commented 6 years ago

Every overlapping read pair in HaplotypeCaller and Mutect2 goes through FragmentUtils.adjustQualsOfOverlappingPairedFragments, which has the following hard-coded logic:

if ( firstReadBase == secondReadBase ) {
   firstReadQuals[firstReadIndex] = Math.min(firstReadQuals[firstReadIndex], HALF_OF_DEFAULT_PCR_ERROR_QUAL);
   secondReadQuals[i] = Math.min(secondReadQuals[i], HALF_OF_DEFAULT_PCR_ERROR_QUAL);
} else {
   firstReadQuals[firstReadIndex] = 0;
   secondReadQuals[i] = 0;
}

This makes sense -- we modify the quals to reflect that sequencing errors are independent but PCR errors are not. However, HALF_OF_DEFAULT_PCR_ERROR_QUAL is 20, so if we start out with high-quality UMI reads we basically kill the quality and with it our SNV sensitivity. Note that in the very important application of cfDNA basically every base is an overlap.

What's worse is that Mutect2 later throws out one of the reads, so that we could start out with two BQ = 60 reads and end up with one BQ = 20 read!

@fleharty @yfarjoun This is a problem. I could just make the PCR error quality an adjustable parameter, but can you think of anything else?

yfarjoun commented 6 years ago

first of all, this looks like a nice easter-egg to find! congrats!

I agree with you that this is a real problem. I don't understand the logic, why HALF of PCR_ERROR_QUAL? if that's really 20, then it's way too low!!

I also object to the second part (when the bases disagree.) imagine that one base is A@Q2 and the other is T@Q60...why would you put both bases to Q0 in that case?

I think that when the bases agree they should essentially add up, but no more than "PCR_ERROR" (but not lose q-score if they are already higher then PCR_ERROR). If they differ, the bigger one should lose the points that the lower one has.....though that doesn't take into account the PCR error which needs to be thought out more carefully. We should take some time to figure out the model that allows for PCR error and then derive the posterior posteriors from that....

davidbenjamin commented 6 years ago

I agree with you that this is a real problem. I don't understand the logic, why HALF of PCR_ERROR_QUAL? if that's really 20, then it's way too low!!

The intent of this code is that downstream code will not do anything special to avoid over-counting overlapping mates, so that assigning half of the PCR qual effectively gives the full PCR qual for the fragment. Of course, this assumption is wrong in the case of M2.

I also object to the second part (when the bases disagree.) imagine that one base is A@Q2 and the other is T@Q60...why would you put both bases to Q0 in that case?

Good point.

We should take some time to figure out the model that allows for PCR error and then derive the posterior posteriors from that...

Are you suggesting something like there are binary indicators for PCR error, read 1 sequencing error, read 2 sequencing error, with priors given by the PCR and base qualities, and we want the posteriors of these indicators given that the bases agree / disagree?

yfarjoun commented 6 years ago

Yes. We should expect that PCR error shouldn't affect the base-quality, so two high quality, disagreeing bases are an indication of a PCR error, while one low-quality base, and one high quality base that have differing qualities looks more like a sequencing error. We might be able to obtain a data-driven model for that using the overlapping bases themselves (over monomorphic sites).

The only problem is that this is only true when the reads haven't been processed by Consensus calling....but if we have a good model for consensus calling within haplotype caller we could avoid doing that upfront and simply deal with everything within haplotype caller. That would be ideal!

magicDGS commented 6 years ago

Time ago, I added a method to fix the qualities of overlapping mates in the same way as samtools because I needed it for other reasons:

https://github.com/broadinstitute/gatk/blob/8d929ed89984137fc00009b322074f7a0908d2a5/src/main/java/org/broadinstitute/hellbender/utils/pileup/ReadPileup.java#L310-L349

Maybe you can have a look to it as a starting point. Although I am not sure about which one is the logic for the agreement between bases (sum of qualities) or disagreement (SAMTOOLS_OVERLAP_LOW_CONFIDENCE factor), I think that what they are doing is more reasonable the FragmentUtils.adjustQualsOfOverlappingPairedFragments.

gmagoon commented 5 years ago

As a user of HaplotypeCaller, I've found that the GQs for homozygous genotypes can be significantly overestimated (by as much as a factor of two, on phred scale) in the presence of overlapping read pairs. The fragment gets double-counted (once for each read) in the genotype likelihood calculation, as HaplotypeCaller doesn't seem to apply anything like the filterOverlappingReads step used in Mutect2. I haven't been able to figure out the rationale behind the FragmentUtils.adjustQualsOfOverlappingPairedFragments approach, at least in the context of the genotype likelihood calculation (maybe it helps somehow for the preceding read likelihood calculation). The approach of @magicDGS / samtools makes much more sense to me, as does the filterOverlappingReads mentioned by @davidbenjamin apparently used in Mutect2, and the separate consideration of overlapping reads when calculating genotype likelihoods apparently used in UnifiedGenotyper.