broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 588 forks source link

HaplotypeCaller homozygous call GQ is overreported in the presence of overlapping read pairs #5436

Open gmagoon opened 5 years ago

gmagoon commented 5 years ago

It seems that the genotype likelihood calculation used by HaplotypeCaller ends up double-counting fragment evidence when the variant is covered by both reads in a read pair. For a homozygous call, this leads to overestimated PL for the heterozygous genotype and ultimately a GQ that is too high for the given read data evidence. For example, for a biallelic variant in a diploid genome covered by a single read pair with the ALT allele, we'd typically expect PL like (x,3,0) (with x some number greater than 3, based on mapping quality, base quality, etc.) and GQ of 3 [assuming uninformative prior such that PL=GP]. But HaplotypeCaller reports PL like (y,6,0) and GQ of 6. So, in certain situations (especially with low coverage depth and/or short fragment size), it appears that the phred-scaled GQ for homozygous calls could be overreported by up to a factor of two.

I understand that some base quality heuristics are applied for any overlapping region of read pairs, prior to the read likelihood calculation (cf. #4958). But these adjustments do not seem to address the double-counting issue for the genotype likelihood calculation.

This may very well be a known issue, and I understand that there may not be an easy fix, but I haven't noticed it explicitly mentioned in the issue tracker (aside from my own comments on #4958). So I figured it was worth describing in a new entry.

davidbenjamin commented 5 years ago

@gmagoon You're right to raise the issue. The tricky thing about it is that overlapping mates' bases are independent evidence as far as sequencing error is concerned, but are not independent as far as PCR error is concerned, so there are reasons to use fragments as the fundamental unit, but also reasons to use reads. This is why a quick fix is probably not going to be satisfactory. A few of us are thinking about something principled. Suggestions are welcome, of course!

gmagoon commented 5 years ago

Thanks @davidbenjamin.

For what it is worth, here are some of my thoughts. From my perspective, it might helpful to separate the discussion into read likelihood calculation and the genotype likelihood calculation. As I understand it, sequencing error is directly relevant to the pairHMM read likelihood calculation. I guess the location where PCR error is most relevant (read likelihood vs genotype likelihood) would depend on whether the read likelihood is computing a "fragment likelihood" (the likely DNA sequence of the fragment being sequenced) or a "haplotype likelihood" (the likely source haplotype for the fragment being sequenced). In any case, as I see it, this particular issue could be interpreted as a shortcoming in only the HaplotypeCaller genotype likelihood calculation, and it would essentially be an issue of double counting. So I'm not sure that a quick fix for this issue is necessarily off the table...Maybe something like one of the following could be used just before or during the genotype likelihood calculation:

As I see it, #4958, which seems to be more related to read likelihood calculation, is where a more involved solution, with more fundamental changes, might be warranted. From my perspective (not being especially familiar with the pairHMM model), an ideal solution would transition the pairHMM from read likelihood to a "fragment likelihood" or "haplotype likelihood" when information from read pairs is available, even if they aren't overlapping. The idea would be that a modified pairHMM model could produce a single fragment (or haplotype) likelihood for a given read pair. Such an approach would unify the issues of "merging" read pairs and phasing in a read-pair aware manner (and potentially also modeling PCR errors). In principle, a "fragment likelihood"-type approach could even incorporate info from corresponding PCR duplicates to improve the results when sequencing error rates are high. This sort of approach could also flow into the genotype likelihood calculation by providing a single, merged fragment likelihood to consider rather than a separate read likelihood for each read. UPDATE: On further thought, it probably wasn't a good idea for me to use the terms fragment likelihood or haplotype likelihood to distinguish the proposed approach, since the read likelihood is effectively already calculating those. Probably a better term would be "read set likelihood", where the read set would consist of paired reads and potentially also any corresponding PCR duplicate reads.

davidbenjamin commented 5 years ago

@gmagoon Between PRs #5794 (merged) PR #5831 (open) this will be handled fairly well in Mutect2 and will be added to a growing list of changes that are probably worth bringing into HaplotypeCaller.

gmagoon commented 5 years ago

Sounds great, thanks for your work on this and for the update, @davidbenjamin !

freeseek commented 5 years ago

I also have the same problem here. I don't really care about the GQ field, but I do care about the AD fields. I get over-counting due to overlapping reads. @davidbenjamin what options should I use with Mutect2 to avoid this issue?

davidbenjamin commented 5 years ago

@freeseek No option currently, but after #5831 goes in and Mutect2 groups paired reads together I intend to refactor our annotation engine to account for this. That way we can have both read-based annotations and pair-based annotations.