broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 590 forks source link

Is Mutect2 reporting variant allele fraction (AF) after adding a 1-read "prior" to the supporting allele depth (AD) values? #8080

Open carbocation opened 1 year ago

carbocation commented 1 year ago

Documentation request

Mutect2 seems to compute allele fraction (AF) from the allele depth (AD) data after adding a prior (which seems to be that it adds 1 read to each of the alleles). However, as of at least GATK 4.2.5.0, Mutect2 produces a header that does not indicate that this is happening: (##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">).

Tools involved

Background

There is a question about the meaning of the AF field that seems to come up often in the GATK forum and it feels like having clear documentation around this would be helpful.

My impression is that Mutect2 might be using an AD "1-read-per-allele" prior and incorporating that into its reported AF. From the article on informative reads, once you're at the sample level (FORMAT field), both DP and AD appear to include only informative alleles. It is tempting to think that AF would be computed from them directly (e.g., AD_alt / DP, which is equivalent to AD_alt/[AD_alt+AD_ref] in the biallelic case since only informative reads are retained).

However, as noted in those linked forum posts, Mutect2 (in my case, version 4.2.5.0) does not produce AF values that can be computed from the AD values in that way. Rather, the AF value appears to incorporate a prior. 

I investigated this across a range of allele depths in real calls. Here are some examples. The format is: AlleleDepthRef,AlleleDepthAlt DP AF[provided by Mutect2] AF[if I calculate it myself]
0,1 1 0.667 1.000
23,4 27 0.170 0.148
39,125 164 0.758 0.762

The intuition here is that there is a huge discrepancy between the Mutect2 AF and the AF I calculate when AD (or DP) is small (first row), and the error gets smaller as DP increases. The formula that Mutect2 seems to use to compute AF is:

# Formula that Mutect2 seems to use to calculate AF of the alternative allele in a biallelic scenario
Mutect2_AF = (ADalt+1) / (ADalt+1 + ADref+1)

# Which is equivalent to:
Mutect2_AF = (ADalt + 1) / (DP + 2)
  1. Is my inference about a prior weight being added by Mutect2 prior to computing AF accurate?
  2. If so, is it intended behavior?
  3. If so, can the VCF header field be a bit more informative about this?
williambrandler commented 1 year ago

thanks for this explanation @carbocation. I noticed it too. Do you use your AF or mutect2's AF?

carbocation commented 12 months ago

The best explanation that I've seen comes from @davidbenjamin at this now-deprecated forum page:

There are differences in the actual variant allele frequency and the VAF provided by GATK.

Do you mean that AF != AD/DP? It’s true that this is not what Mutect2 emits for the AF, which is a probabilisitc estimate that accounts for uncertainty. To see what I mean, suppose there are 100 total reads, 90 of which are ref and 10 of which each have a 60% chance of supporting the alt allele. As far as AD is concerned this is 10 alt reads, but as far as probability is concerned this is an expected 6 alt reads.

davidbenjamin commented 12 months ago

It's more complicated than just adding a prior. Technically it's the mean posterior estimate of the cellular allele fraction within the variational Bayesian somatic likelihoods model, conditioned on the allele existing as a somatic variant. That is, "if this is real, here's a guess at the allele fraction."

This is inflated relative to the arithmetical AD/DP estimate (with is the read allele fraction, not inherently a biological quantity) because if it were real it's possible that it just randomly got a small draw of reads.

Personally, I don't think either Mutect2's AF or AD/DP are all that useful. I mean, FilterMutectCalls does a better job than any hard filter one could concoct using either definition of AF, and if you want to make some model like tumor phylogeny involving cellular fractions you would need to work directly with AD and DP, as the somatic clustering model within FilterMutectCalls does, albeit crudely.