broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

Inconsistent Handling of Artificial Reads in Allele Depth (AD) Calculation #8959

Closed mrymkdnz closed 2 months ago

mrymkdnz commented 2 months ago

version: 4.2.0.0 command:

    "gatk --java-options '{java_opts}' HaplotypeCaller {extra} "
    "-L {snakemake.input.interval_list} "
    "-R {snakemake.input.ref} {bams} "
    "-A StrandBiasBySample "
    "-bamout {snakemake.output.bam} "
    "-O {snakemake.output.gvcf} {dbsnp} {log}"

I have encountered an issue with HaplotypeCaller's handling of artificial reads when calculating the Allele Depth (AD) in the VCF output. After examining the IGV images and VCF outputs for multiple variants, I noticed an inconsistency in how artificial reads are counted towards the AD.

Details:

For the variant at position chr15:93002203 with the reference allele G and alternate allele GA, the VCF output shows: AD=24,6

chr15 93002203 . G GA 73.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=34;ExcessHet=3.0103;FS=21.417;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.45;ReadPosRankSum=1.481;SOR=1.028 GT:AD:DP:GQ:PL:SB 0/1:24,6:30:81:81,0,597:22,2,2,4

Upon reviewing the bamout IGV image, I confirmed that the artificial reads were not considered informative and were therefore not included in the AD calculation, which aligns with the expected behavior.

image

This behavior, where artificial reads are excluded from the AD calculation, is something I have observed across multiple variants, not just the example provided above.

However, a different behavior was observed with another variant at position chr1:31662674 with the reference allele G and alternate allele GGGC. The VCF output for this variant shows: AD=22,4

chr1 31662674 . G GGGC 52.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.692;DP=30;ExcessHet=3.0103;FS=41.746;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.02;ReadPosRankSum=-2.529;SOR=3.219 GT:AD:DP:GQ:PL:SB 0/1:22,4:26:60:60,0,911:0,22,4,0

In this case, the bamout IGV image shows only 2 insertions excluding the artificial reads, yet the AD includes 4, suggesting that the artificial reads were counted. This is contrary to the behavior observed with the first variant, where artificial reads were not counted.

image

Why are artificial reads being included in the AD calculation for some variants but not for others? This inconsistency can lead to confusion and potentially affect downstream analyses. I would appreciate any insights or guidance on this issue.

Thank you for your support.

gokalpcelik commented 2 months ago

Can you turn off downsampling in IGV and try checking your variant sites see if AD checks out with any of those parameters?

gokalpcelik commented 2 months ago

Additionally. You may wish to compare your variant depths and allelic balances to those captured in gnomAD 4.1 Variant page here Looking at the balances of heterozygous calls your sites don't seem to display any inconsistencies compared to those available in gnomAD sample set.

Artificial Haplotypes are produced by HaplotypeCaller are available only for debugging purposes. Bamout also includes those informative reads used to produce these haplotypes to provide additional debugging support. Soft clipped bases, realignment, overlapping pairs, duplicates and reads that are available to HaplotypeCaller due to interval restrictions all affect the way you observe these bamouts.

HaplotypeCaller by design is not 100% deterministic in performing its calls. But this does not mean it is inconsistent in making calls when provided all the available reads and reference for its function. That is why we don't recommend restricting HaplotypeCaller when making variant calls with short intervals.

mrymkdnz commented 2 months ago

image

I followed the suggestion to turn off the downsampling option, but there was no visible change in the results. Below are the igv images after disabling the downsampling:

Variant: chr15 93002203: image

Variant: chr1 31662674:

image

mrymkdnz commented 2 months ago

I also included the 15-93002203-G-GA variant to demonstrate support for the reverse. What confused me was the 1-31662674-G-GGGC variant.

image

IGV image shows only 2 insertions excluding the artificial reads, yet the AD includes 4, suggesting that the artificial reads were counted. This is contrary to the behavior observed with the first variant, where artificial reads were not counted.Why are artificial reads being included in the AD calculation for some variants but not for others?

gokalpcelik commented 2 months ago

Looking at the below image I can tell that artificial reads were still not included in the AD calculation. 2 of those reads span the 3bp insertion and 2 bottom reads seem to contain possible realigned portions within but not in the aligner mapped state therefore counting those reads (unless they are duplicates) could add up to 4 alt allele count in the realignment and PairHMM. So AD is the number of informative reads that are used in generating those Artificial Haplotypes that you see in the bamout. Artificial haplotypes are not used in the allele counting operation.

If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

I hope this explains the issue.

mrymkdnz commented 2 months ago

Alright, we agree that the issue is not related to the insertions seen in the artificial haplotypes. If I understood correctly, you are referring to the two insertions shown below. Doesn't it seem a bit strange that these four insertions are counted as a single variant when their sizes aren't even the same? Please correct me if I'm wrong; I'm just trying to understand.

image

gokalpcelik commented 2 months ago

When reads are mapped onto the genome short reads don't always find the best spot for indels. Sometimes reads are clipped at a position where a particular indel could have been mapped properly. Those regions you showed here are all homopolymer rich regions where assembly and variant calls are usually harder than other places. 3bp insertion within a GC rich region could easily be mapped wrong due to G and C nucleotide positioning and the way G/C nucleotide is handled by the sequencing instrument. Due to chemistry and optics reasons certain basecalls in GC rich regions may get convoluted with wrong nucleotide assertions such as 1 less G and one more C.

Reassembly, Realignment and PairHMM removes such artifacts by looking at basecalling metrics, mapping qualities, regional metrics etc. I cannot see it directly from the image however it is possible that some of those reads could have been soft/hard clipped due to such errors but yet they are still valid and usable by the assembly engine.

You may wish to read about the DepthPerAlleleBySample class and its documentation as it is the object class that calculates AD for variant contexts. DepthPerAlleleBySample

I hope this helps.

mrymkdnz commented 2 months ago

Thank you for the detailed explanations you have shared.