broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.65k stars 582 forks source link

HaplotypeCaller consult for Laura #4561

Closed sooheelee closed 4 years ago

sooheelee commented 6 years ago

For @ldgauthier, upon her return from the Montreal workshop. We would like to know if this is expected behavior from HaplotypeCaller.

Researcher has uploaded read-level data to our FTP site. I've recapitulated their results with GATK4 in a dsde-docs issue ticket at https://github.com/broadinstitute/dsde-docs/issues/3008. Data may be private so please follow up in the dsde-docs repo.

ldgauthier commented 6 years ago

We were still making changes to the assembly in versions 3.2 and 3.3. For example: https://github.com/broadinstitute/gsa-unstable/pull/582 Nothing is popping out at me as being the breaking change after 3.3 though.

I think the problem is that there are too many haplotypes in that region. There are at least 8 plausible variants, which makes for ~256 haplotypes. We pick the "best" 128 to evaluate likelihoods against. Here it seems that what we're choosing as the best don't include the SNP. But actually it's not even in the graph. image (The 280 ref vs 211 split is the het SNP at 89,100,730 so the missing variant should be split out of the big reference string above but it's not)

The raw graph has the variant on a dangling head (I highlighted the base in the middle path in the figure of the raw_readthreading_graph), but it must not be merged back in properly. image

I wonder if that PR above was the one that changed things. Maybe @vruano will take a look?

chandrans commented 6 years ago

Thank you Laura and Valentin for looking into this. @ldgauthier @vruano

vruano commented 6 years ago

@ldgauthier do you still have the full image for the raw graph around? Is it possible for you to post it without make it blow up the screen (I guess there might be a markdown option to chose the disply size.

One thing that stops dangling head/tails from being merged are furcations from the point they merge into the reference path. So for example if the middle chain containing the SNP and the right chain merge first before merging into the left/red reference chain then that would prevent the merging of either of the two non-reference branches.

ldgauthier commented 6 years ago

Yep, that looks like exactly what happens. image (If that's not enough context, I can email you the .dot files -- github won't let me attach them)

I've never understood what it means when the whole branch has weight 1/1 with dashed arrow, like for the rightmost path. Is that just to show that that will get pruned eventually?