Open carbocation opened 1 year ago
thanks for this explanation @carbocation. I noticed it too. Do you use your AF or mutect2's AF?
The best explanation that I've seen comes from @davidbenjamin at this now-deprecated forum page:
There are differences in the actual variant allele frequency and the VAF provided by GATK.
Do you mean that AF != AD/DP? It’s true that this is not what Mutect2 emits for the AF, which is a probabilisitc estimate that accounts for uncertainty. To see what I mean, suppose there are 100 total reads, 90 of which are ref and 10 of which each have a 60% chance of supporting the alt allele. As far as AD is concerned this is 10 alt reads, but as far as probability is concerned this is an expected 6 alt reads.
It's more complicated than just adding a prior. Technically it's the mean posterior estimate of the cellular allele fraction within the variational Bayesian somatic likelihoods model, conditioned on the allele existing as a somatic variant. That is, "if this is real, here's a guess at the allele fraction."
This is inflated relative to the arithmetical AD/DP estimate (with is the read allele fraction, not inherently a biological quantity) because if it were real it's possible that it just randomly got a small draw of reads.
Personally, I don't think either Mutect2's AF or AD/DP are all that useful. I mean, FilterMutectCalls does a better job than any hard filter one could concoct using either definition of AF, and if you want to make some model like tumor phylogeny involving cellular fractions you would need to work directly with AD and DP, as the somatic clustering model within FilterMutectCalls does, albeit crudely.
Documentation request
Mutect2 seems to compute allele fraction (AF) from the allele depth (AD) data after adding a prior (which seems to be that it adds 1 read to each of the alleles). However, as of at least GATK 4.2.5.0, Mutect2 produces a header that does not indicate that this is happening: (
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
).Tools involved
Background
There is a question about the meaning of the AF field that seems to come up often in the GATK forum and it feels like having clear documentation around this would be helpful.
My impression is that Mutect2 might be using an AD "1-read-per-allele" prior and incorporating that into its reported AF. From the article on informative reads, once you're at the sample level (FORMAT field), both DP and AD appear to include only informative alleles. It is tempting to think that AF would be computed from them directly (e.g.,
AD_alt / DP
, which is equivalent toAD_alt/[AD_alt+AD_ref]
in the biallelic case since only informative reads are retained).However, as noted in those linked forum posts, Mutect2 (in my case, version 4.2.5.0) does not produce AF values that can be computed from the AD values in that way. Rather, the AF value appears to incorporate a prior.
The intuition here is that there is a huge discrepancy between the Mutect2 AF and the AF I calculate when AD (or DP) is small (first row), and the error gets smaller as DP increases. The formula that Mutect2 seems to use to compute AF is: