Closed davidbenjamin closed 7 years ago
I think @yfarjoun and I discussed this a long time ago. In theory it would speed up the really simple cases. I think the main blocker in implementing it would be the complexity of the existing code, though I don't know how much you cleaned up the GATK4 version. We would also want to make sure that we only implement this optimization if that SNP is the only SNP on the haplotype. In cases where the haplotype has multiple SNPs and the phasing is poor, this could artificially inflate the likelihoods. Although we've seen that the graph traversal frequently breaks phasing then generating haplotypes anyway, so maybe I overestimate our current likelihood accuracy.
Anyway, take my advice with a grain of salt. It's just some musings from a bored and somewhat sleep-deprived mom with a sleeping baby on her lap.
I think the main blocker in implementing it would be the complexity of the existing code, though I don't know how much you cleaned up the GATK4 version.
We refactored all the engine stuff shared with HaplotypeCaller
to be very distinct from the somatic genotyping logic, so the only complexity would be in local assembly and PairHMM. Which could be significant, of course.
We would also want to make sure that we only implement this optimization if that SNP is the only SNP on the haplotype. . .Although we've seen that the graph traversal frequently breaks phasing
The specific case I had in mind is when you have a bubble or something more complex in the graph, followed by a stretch of reference (i.e. all haplotypes have nothing going on here), followed (or not) by more activity. It seems reasonable in that case to chop each active area into its own haplotype(s), which is equivalent to pinning the ref-only area to be ref-only in PairHMM. I believe but could be wrong that in a case like this our assembly would not respect phasing between the two active areas anyway, so we lose nothing.
By the way, I should clarify that the idea is not to truncate the ActiveRegion
, but rather to break it into a few small haplotypes semi-intelligently after building the whole deBruijn graph.
It could well be that my optimism is ill-founded. Nonetheless, having a quick-and-dirty mode would be very useful for the following purposes where you need to run M2 a lot and don't need perfection:
"Quick and dirty" would be useful for testing changes, but the PoN and training sets shouldn't be recreated very often so there's less savings.
I hate to leverage the fact that we break phasing to optimize things because I dream of a future where HaplotypeCaller actually calls haplotypes (as you've already added an issue for).
Issue moved to broadinstitute/gatk #2945 via ZenHub
It might be sufficient, especially for SNP calling, to run
PairHMM
over s small number of bases, say 20 or so, surrounding a variant. This might make sense forHaplotypeCaller
as well. . .