broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Speed up Mutect with small haplotypes #909

Closed davidbenjamin closed 7 years ago

davidbenjamin commented 7 years ago

It might be sufficient, especially for SNP calling, to run PairHMM over s small number of bases, say 20 or so, surrounding a variant. This might make sense for HaplotypeCaller as well. . .

ldgauthier commented 7 years ago

I think @yfarjoun and I discussed this a long time ago. In theory it would speed up the really simple cases. I think the main blocker in implementing it would be the complexity of the existing code, though I don't know how much you cleaned up the GATK4 version. We would also want to make sure that we only implement this optimization if that SNP is the only SNP on the haplotype. In cases where the haplotype has multiple SNPs and the phasing is poor, this could artificially inflate the likelihoods. Although we've seen that the graph traversal frequently breaks phasing then generating haplotypes anyway, so maybe I overestimate our current likelihood accuracy.

Anyway, take my advice with a grain of salt. It's just some musings from a bored and somewhat sleep-deprived mom with a sleeping baby on her lap.

davidbenjamin commented 7 years ago

I think the main blocker in implementing it would be the complexity of the existing code, though I don't know how much you cleaned up the GATK4 version.

We refactored all the engine stuff shared with HaplotypeCaller to be very distinct from the somatic genotyping logic, so the only complexity would be in local assembly and PairHMM. Which could be significant, of course.

We would also want to make sure that we only implement this optimization if that SNP is the only SNP on the haplotype. . .Although we've seen that the graph traversal frequently breaks phasing

The specific case I had in mind is when you have a bubble or something more complex in the graph, followed by a stretch of reference (i.e. all haplotypes have nothing going on here), followed (or not) by more activity. It seems reasonable in that case to chop each active area into its own haplotype(s), which is equivalent to pinning the ref-only area to be ref-only in PairHMM. I believe but could be wrong that in a case like this our assembly would not respect phasing between the two active areas anyway, so we lose nothing.

By the way, I should clarify that the idea is not to truncate the ActiveRegion, but rather to break it into a few small haplotypes semi-intelligently after building the whole deBruijn graph.

It could well be that my optimism is ill-founded. Nonetheless, having a quick-and-dirty mode would be very useful for the following purposes where you need to run M2 a lot and don't need perfection:

ldgauthier commented 7 years ago

"Quick and dirty" would be useful for testing changes, but the PoN and training sets shouldn't be recreated very often so there's less savings.

I hate to leverage the fact that we break phasing to optimize things because I dream of a future where HaplotypeCaller actually calls haplotypes (as you've already added an issue for).

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #2945 via ZenHub