broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

HaplotypeCaller phasing only has 90% sensitivity to adjacent SNPs #3368

Open ldgauthier opened 7 years ago

ldgauthier commented 7 years ago

Emma, a summer student in the MacArthur lab, did an analysis of the HaplotypeCaller phasing from GATK3.4 (which I believe is unchanged since then) using the gnomAD exomes and genomes.

image

The goal of that feature is to provide enough information to change the representation of adjacent SNPs to MNPs for more accurate functional annotation. However, only 90% of adjacent SNPs have phasing information. Analysts would prefer 100% of adjacent SNPs to have phasing information with a quality estimate.

As a side note, most of the gnomAD exomes are 75bp reads and the maximum graph assembly kmer size is 65bp, so it's interesting that there is some phasing information output for SNPs as far as 310 bases apart (not on the graph, but that's what Emma said), especially considering that HaplotypeCaller is not mate-aware. Without digging into the data, I'm guessing these are cases where there's another het SNP in between.

epiercehoffman commented 7 years ago

Thanks Laura for adding the ticket! Quick correction-- the largest distance between a pair of heterozygous variants in the same phase group (same PID) that I observed was 219 bp. That was in a phase group with multiple heterozygous variants.

ldgauthier commented 7 years ago

Oops, thanks for correcting and clarifying @epiercehoffman !

sooheelee commented 6 years ago

I have a user on the forum asking whether being unable to phase MNPs is intended. https://gatkforums.broadinstitute.org/gatk/discussion/11122/pgt-and-pid-is-a-dot#latest

The PGT:PID shows up as .:. for these trailing SNPs on the same reads as upstream same-phased SNPs.

ldgauthier commented 6 years ago

It's hard to say without seeing all of the data. I answered on the forum.

nh13 commented 5 years ago

@ldgauthier any updates on a solution? We have an example for clinically reportable variant that matches #5824.

ldgauthier commented 5 years ago

Our long term solution is a rather large modification to the graph assembly code: https://github.com/broadinstitute/gatk/issues/5828

That will likely take a couple months, but we fully expect a dramatic improvement in phasing. Since we're working on that, spending time in a quick fix is just going to make the long term fix take longer.

On Thu, Apr 11, 2019, 4:46 PM Nils Homer notifications@github.com wrote:

@ldgauthier https://github.com/ldgauthier any updates on a solution? We have an example for clinically reportable variant that matches #5824 https://github.com/broadinstitute/gatk/issues/5824.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3368#issuecomment-482307485, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdBhIYIXwi9yIHobr45Nil-8yzNgTks5vf58ygaJpZM4Olg1H .

nh13 commented 5 years ago

Thanks! I’ll keep watching and let me know if you have a version I can try later this year.