broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

GATK Haplotypecaller is calling Heterozygous variant, but bamout doesn't reflect the read frequencies in gvcf file #8238

Open kvn95ss opened 1 year ago

kvn95ss commented 1 year ago

Bug Report

Affected tool(s) or class(es)

GATK Haplotypecaller

Affected version(s)

(4.3.0.0 and 4.2.6.1)

Description

We used Haplotypecaller in GVCF mode (initially in 4.2.6.1, then again with 4.3.0.0) for one of our human samples. These samples were joint-genotyped with ~2.7k exomes. The exact command used was -

gatk HaplotypeCaller -R "$ref_hg38" -I input.bam -L twist.bed -ERC GVCF -ip 50 -O test_latest.gvcf.gz -bamout test_bamout_latest.bam --tmp-dir /mnt/exome/tmp/

Below is a screenshot of the bamout file (top track) and the recalibrated BAM file (bottom track).

image

As shown in screenshot, there are only 2 reads supporting the alternate allele, however the gvcf.gz file has the below entry for the variant -

chr5 176530208 . T C,<NON_REF> 2717.64 . BaseQRankSum=0.975;DP=179;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=644400,179;ReadPosRankSum=10.927 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:68,102,0:170:99:0|1:176530208_T_C:2725,0,1279,2928,1584,4512:176530208:38,30,37,65

This variant is called as a heterozygous variant 0|1 with read frequency of 68,102,0, which would mean 68 reads supporting the ref allele, 102 reads supporting the alt C allele and 0 reads supporting the allele. After joint-genotyping, the variant was classified as LOW_VQSLOD.

Steps to reproduce

Let us know how to share the BAM file subset and let us know if the error is reproducible.

Expected behavior

Ideally, if the variant had been called, its read frequency should have been represented more accurately.

Actual behavior

The read frequencies are not matching up.

kvn95ss commented 1 year ago

Hello team,

Has anyone looked into this? We can share the bamout files and / or the data to recreate the problem from your end. Do let us know how to proceed.

vruano commented 1 year ago

Is there any other variants up or downstream within a read length? (try to zoom out a bit)

Also when looking into a bamout it helps to color reads and haplotypes by their best haplotype id (HC tag I think).

kvn95ss commented 1 year ago

Hay, thanks for the response.

There are two homozygous variants downstream. We have colored the reads by HC tag, please see the image below.

image

vruano commented 1 year ago

I guess that the AD counts that you are observing match the two non gray color you see in the plot. You can get the Records that represent the haplotypes grouped away if you group "reads" by sample. "HC" is the artificial sample assigned to haplotype encoding read records. I bet that one of them has the C mutation with at least one of the two mutations downstream and the other is just the reference.

If that is the case then the question is why HaplotypeCaller did not reconstruct the haplotype with the downstream alternative only (no C mutation).

kvn95ss commented 1 year ago

Grouping the reads by sample doesn't show any change in reads. And your bet was right - there are two reads with C mutation which have both downstream variants. There are ~103 reads which contain the two downstream variant, which is close to the Alt read count in the called variant.

How do we proceed from here?

vruano commented 1 year ago

Sorry, perhaps I wasn't clear. The bamout not only output the read records realigned to the reference (thru their alignment to te best haplotype) but also the haplotype themselves as reads records. I'm not 100% sure of this is true for your run but these special records probably have a sample name "HC" and as read id something like "HCXXXXXXX" where XXXXXXX is a number.

My hope was that by grouping by sample the haplotypes would stand out and that we could then verify what haplotypes were reconstructed.

My suspicion is that haplotype with no C mutation but with downstream mutation has not be reconstructed.. You need to identify the complete list of reconstructed haplotypes to confirm that, either by grouping somehow haplotypes away from the actual reads in the bamout or perhaps looking into HaplotypeCaller's debugging output.

If that is true, what is happening is that reads that contain the downstream mutation would artifactually have support for the C mutation even if the have a ref base for that position or if they don't even overlap that position.

So if this is confirmed, the following step would be to figure out why this is happening (not reconstructing that obvious haplotype) and fix the issue. Hopefully someone in the GATK developer team can look into this as you may well have hit an interesting edge case that needs to be ironed out.

droazen commented 1 year ago

Thanks @vruano ! :)

kvn95ss commented 1 year ago

Hay,

I sliced out this region with PrintReads at loci chr5:176529708-176530708 as the artifact was at loci chr5:176530208.

Interestingly, HaplotypeCaller didn't call the artefact when the bamout was given as input, it only calls the artifact when calling with the whole exome data. I have attached the bamout and haplotypecaller output for the same. Sadly can't share the entire file as its patient data. Do let us know if it is of any help.

bamout_and_haplotypecaller_output.zip