broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

GATK fails to genotype insertion immediately following a deletion at hg19:chr1:68896832 #6538

Open tfenne opened 4 years ago

tfenne commented 4 years ago

Bug Report

Affected tool(s) or class(es)

HaplotypeCaller

Affected version(s)

Description

I have a sample that has a slightly complicated event in it that is getting miscalled. The easiest way to see it is probably with an IGV screenshot:

variant

BWA aligns the reads with a 7bp deletion followed by 2 mismatches, though am inclined to think of it as a 9bp deletion coupled with a 2bp insertion (or a swap of 9bp of reference for 2bp of novel sequence). The original alignments are in the top half of the IGV view. The bottom is the assembly BAM from running the HaplotypeCaller. From what I see the assembly is getting it right.

But the problem is that the event extraction/genotyping goes wrong. I've run it two ways. If I run to generate a called VCF directly using:

gatk HaplotypeCaller \
  --input snippet.bam \
  --output snippet.vcf \
  -R hg19/hg19.fa \
  --bam-output assembly.bam \
  -L chr1:68896800-68896900 \
  --ploidy 2 \
  --min-pruning 2 \
  --min-dangling-branch-length 2 \
  --pcr-indel-model CONSERVATIVE  

Then I get only a single variant reported in the region (the 9bp deletion):

#CHROM  POS       ID  REF         ALT  QUAL     FILTER  INFO                                                                                                                                                         FORMAT          test-sample
chr1    68896832  .   CTTTAGTTTT  C    1597.60  .       AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=122;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.52;ReadPosRankSum=1.341;SOR=0.350  GT:AD:DP:GQ:PL  0/1:67,43:110:99:1605,0,2683

If i run to generate a gvcf then things get more interesting:

gatk HaplotypeCaller \
  --input snippet.bam \
  --output snippet.g.vcf \
  -R hg19/hg19.fa \
  -ERC GVCF \
  --bam-output assembly.bam \
  -L chr1:68896800-68896900 \
  --ploidy 2 \
  --min-pruning 2 \
  --min-dangling-branch-length 2 \
  --pcr-indel-model CONSERVATIVE  

yields:

#CHROM  POS       ID  REF         ALT              QUAL     FILTER  INFO             FORMAT              test-sample
chr1    68896800  .   G           <NON_REF>        .        .       END=68896831     GT:DP:GQ:MIN_DP:PL  0/0:118:99:107:0,120,1800
chr1    68896832  .   CTTTAGTTTT  C,<NON_REF>      1597.60  .       BaseQRankSum...  GT:AD:DP:GQ:PL:SB   0/1:67,43,0:110:99:1605,0,2683,1807,2813,4620:67,0,43,0
chr1    68896841  .   T           *,TCC,<NON_REF>  344.02   .       DP=110;Exces...  GT:GQ:PL            ./.:99:0,0,0,0,0,0,0,0,0,0
chr1    68896842  .   G           <NON_REF>        .        .       END=68896899     GT:DP:GQ:MIN_DP:PL  0/0:71:99:41:0,99,1485

I.e. a record is emitted for the insertion but the genotype is ./. with a quality of 99, 0s for all the PLs and the other per-sample annotations we'd expect on a variant record missing.

I suspect the problem has something to do with the fact that the deletion and insertion are in cis and the insertion's anchor base is within the deletion. I'm not even sure how one would represent this as a pair of variants. I think ideally this would be emitted as a single variant with REF=CTTTAGTTTT and ALT=CCC.

Steps to reproduce

Use the command lines above with the files attached in deletion-then-insertion-files.zip

Expected behavior

The resulting VCF (and gVCF) should emit one or more variants with called genotypes that indicate a 9bp deletion followed by a 2bp insertion in cis.

Actual behavior

The 9bp deletion is emitted and a) in the gVCF the 2bp insertion is emitted without a genotype or other information that would allow genotyping, b) if a VCF is called directly the 2bp insertion is absent entirely.

nh13 commented 7 months ago

When I ran this with 4.5.0.0, I get:

chr1    68896800    .   G   <NON_REF>   .   .   END=68896831    GT:DP:GQ:MIN_DP:PL  0/0:118:99:107:0,120,1800
chr1    68896832    .   CTTTAGTTTT  C,<NON_REF> 1597.60 .   BaseQRankSum=0.000;DP=122;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=439200,122;ReadPosRankSum=1.341   GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:67,43,0:110:99:0|1:68896832_CTTTAGTTTT_C:1605,0,2683,1807,2813,4620:68896832:67,0,43,0
chr1    68896841    .   T   TCC,<NON_REF>   1596.60 .   DP=110;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=396000,110;ReadPosRankSum=1.566  GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:67,43,0:110:99:0|1:68896832_CTTTAGTTTT_C:1604,0,2684,1806,2813,4619:68896832:67,0,43,0
chr1    68896842    .   G   <NON_REF>   .   .   END=68896899    GT:DP:GQ:MIN_DP:PL  0/0:71:99:41:0,99,1485
chr1    68896900    .   T   <NON_REF>   .   .   END=68896900    GT:DP:GQ:MIN_DP:PL  0/0:41:78:41:0,78,1170

And the 2bp insertion has genotype 0|1 (phased) as we'd like.