broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Phased variants do not have the same phase set identifier #6845

Closed nh13 closed 3 years ago

nh13 commented 4 years ago

I have four phased variants in close proximity that have the following pattern:

chrA 10 ... GT:PS 0|1:1
chrA 20 ... GT:PS 0|1:2
chrA 30 ... GT:PS 0|1:1
chrA 40 ... GT:PS 0|1:2

These four variants are wholly contained in a single set of reads. There are of course other reads that partially span them.

The first variant is a deletion, while the remaining three are SNVs. Examining the reads, there are two haplotypes since:

  1. Alternate for the 1st and 3rd read
  2. Alternate for the 2nd and 4th read

I would have expected them all to have the same phase set (PS) value.

I have a test case I can share privately (let me know a good email to send it to confidentially).

nh13 commented 4 years ago

Expected this:

chrA 10 ... GT:PS 0|1:1
chrA 20 ... GT:PS 1|0:1
chrA 30 ... GT:PS 0|1:1
chrA 40 ... GT:PS 1|0:1
nh13 commented 4 years ago

Version: 4.1.8.1

Command:

gatk \
            --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g" \
            --spark-runner LOCAL HaplotypeCaller \
            -ERC GVCF \
            -I input.bam \
            -R ref.fasta \
            -O out.g.vcf \
            --bam-output out.bam \
            --assembly-region-padding 1000 \
            --max-assembly-region-size 1000 \
            --smith-waterman JAVA \
            --linked-de-bruijn-graph \
            -L <100bp region>
droazen commented 4 years ago

@cwhelan Since you're looking at the phasing code in HC at the moment, would you have any insight into this issue?

nh13 commented 4 years ago

Shared with @cwhelan via email. Thanks for looking at this!

cwhelan commented 4 years ago

Debugging update: the issue in this case seems to be that HaplotypeCaller constructs an additional haplotype containing a fifth variant a little bit downstream of the four called variants. The fifth variant has low read support and is not called, but matches the reference base at all of the four called positions and so is out of phase with both of the possible phase sets suggested by the four called variants. This causes the current phasing algorithm to give up, as it requires all of the haplotypes containing alternate alleles at a locus to agree with one of the phase sets it creates.

A possible solution I am exploring is to exclude haplotypes that don't have any called non-ref alleles from the set of calledHaplotypes passed to AssemblyBasedCallerUtils.phaseCalls by HaplotypeCallerGenotypingEngine.

cwhelan commented 3 years ago

This should be fixed by https://github.com/broadinstitute/gatk/pull/7019

nh13 commented 3 years ago

Thanks @cwhelan !!!