Open brentp opened 8 years ago
Hmm, so I actually think the current max_aaf_all value of 0 is "correct" in so far as it has the desired value. The ExAC webpage shows this variant as a low quality site and appears to be deliberately reporting the AAF on the website as NA because the AAF cannot be reliably determined when the site is poor quality.
Taken to an extreme, if a variant is only called in 1 het person in ExAC, I certainly wouldn't want max_aaf_all to report AAF=1.0 (or any population AAF for that matter) because it would lead me to accidentally exclude a variant that could very well be only present in that single person if one were to do resequencing and determine everyone's actual genotype at that site.
Aye, that makes sense.
Aaron has been pushing for exac_num_homalts and exac_num_hets columns that would be informative for these cases.
We may also expose the FILTER field from the ExAC VCF.
Yeah, I’m not sure the best way to handle this for the max_aaf_all field since you don’t have num_hom_alts, num_hets values for any of the other databases (plus if you start making it so that people have to specify those additional columns, it defeats the simplification goal of max_aaf_all).
Actually more generally, it also brings up as a question how EVS and 1KG are reporting AAF when the site is only called in, for example, <80% of samples (that’s the cutoff ExAC appears to be using).
Adding the exac_num_homalts and exac_num_hets columns would be immensely useful for our group as well since we have faced similar situations mentioned above. Is there an approximate timeline for this?
Thanks a lot! Nima
I'll get it in for the next release. Would an aggregate column of the sum of num_homalts across all populations suffice or would it be better to have 1 column per population?
I would suggest taking ExAC’s approach and for a convenience variable such as max_aaf_all, always conservatively assume NA/Null for the allele frequency if there is high missingness among the sequenced samples.
So, report non-NULL for the proposed fields only if FILTER == PASS ?
Maybe not based on FILTER but based on the adjusted allele number? e.g. for the variant I originally noticed, ExAC’s webpage says “Warning! This variant is only covered in 12 individuals (adjusted allele number = 24).” — that’s what we want to ignore — the variants where there are just too few samples with a genotype to possibly obtain a reasonably accurate allele frequency.
OK. So we need a reasonable cutoff for AN_Adj (24 is the value for this case). Suggestions? Seems like we should put the in the db as well.
I’ve been contemplating this for a little while but never got very far, but technically there’s a way to estimate the confidence interval of a given allele frequency based on sample size… :) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.7107&rep=rep1&type=pdf
Brent Werness (@BrentWerness) had previously suggested to me taking the 95% confidence interval, using "a fancy formula like method 4 on p.859 if you can"
I implemented eqn 4 from that paper as below. Seems to give reasonable results, though the bounds are tighter than I expected (see doctests). The example from the mailing list gives a lower bound of 0: af_lims(0.001428, 24)
gives 0, 0.11 for the lower and upper bounds.
I'll have to have another look tomorrow to decide whether the 2nd arg is n_samples or n_samples * 2.
from math import sqrt
def af_lims(af, n_samples, alpha=0.05):
"""
calculate the confiedence limits (alpha, 1 - alpha) of allele frequency
(af) from n_samples
Uses eqn 4 from:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.7107&rep=rep1&type=pdf
AF of 0.2 with varying samples
>>> af_lims(0.2, 5) #doctest: +ELLIPSIS
(0.120663..., 0.39002...)
>>> af_lims(0.2, 10) #doctest: +ELLIPSIS
(0.1635..., 0.2908...)
>>> af_lims(0.2, 100) #doctest: +ELLIPSIS
(0.19727..., 0.208385...)
>>> af_lims(0.2, 1000) #doctest: +ELLIPSIS
(0.199766..., 0.200803...)
AF of 0.002 with varying samples
>>> af_lims(0.002, 5) #doctest: +ELLIPSIS
(0.0, 1.0)
>>> af_lims(0.002, 10) #doctest: +ELLIPSIS
(0.0, 1.0)
>>> af_lims(0.002, 100) #doctest: +ELLIPSIS
(0.0, 0.01714...)
>>> af_lims(0.002, 1000) #doctest: +ELLIPSIS
(0.001801..., 0.00314...)
"""
p, n = af, n_samples
z = 1 - alpha / 2
q = 1 - p
a = 2*n*p + z **2
denom = 2 * (n + z**2)
try:
bL = z / sqrt((z**2 - 2 - 1) / n + 4*p*(n*q+1))
L = (a - 1 - bL) / denom
except ValueError:
L = 0.0
try:
bU = z / sqrt((z**2 - 2 - 1) / n + 4*p*(n*q-1))
U = (a + 1 + bU) / denom
except ValueError:
U = 1.0
return max(L, 0.0), min(U, 1.0)
print af_lims(0.001428, 24)
if __name__ == "__main__":
import doctest
doctest.testmod()
I believe you have a few transcription errors -- your z is wrong and it should be 1/n:
Here's my R code:
af_lims <- function (p,n) {
q <- 1-p
z <- 1.96
a <- 2*n*p+z^2
denom <- (2*(n+z^2))
L <- (a -1 -z* sqrt(z^2 -2 -1/n + 4*p*(n*q+1))) / denom
U <- (a +1 +z* sqrt(z^2 +2 -1/n + 4*p*(n*q-1))) / denom
paste(L, U)
}
# example Table I in paper (page 861)
af_lims(81/263, 263)
af_lims(1/29, 29)
# example from my original post
af_lims(0.001428, 24)
One thing Brent (Brent Werness) and I have talked about is that if you are going to go through the effort of implementing confidence intervals for allele frequency estimates, you should store both upper, "empirical," and lower values for the allele frequency.
Since I do Mendelian analysis, my main concern is being cautious and not accidentally excluding a variant that is actually rare but appears to be common because of sequencing issues. Therefore, I would always choose to use the lower limit of the confidence interval for my filtering.
Perhaps someone doing some other analysis (I can't think of one offhand, but who knows) might be concerned with not accidentally including a rare variant, in which case, they would want to use the upper limit of the confidence interval.
Other people might be decide this is all too complex and use the empirical allele frequency (though as we can see here, this is a bad idea!) :)
The intervals look more sane after I update the code to be correct as you suggest: https://gist.github.com/brentp/172e213a87f3f6188891
We are thinking to make this available to call somehow given that the user has AC and AN.
Other options are to include an aggregate column like max_aaf_all for min(lower CI) and one for max(upper CI) across ExAC,ESP,1KG
I like the aggregate option because it is easy/simple to replicate without mistakes over the long run and because now that you've gone through the effort of implementing this, I'd personally choose to use max_aaf_all_lowerCI over max_aaf_all! ;)
hi everyone, the columns for exac (exac_num_het, exac_num_hom_alt, exac_num_chroms) are now available in master. we have a few things to go in before the next release, but you can access with gemini update --devel let us know any feedback.
Cool! Are you eventually (future release) planning to do the confidence interval thing for each of the reference databases? (ExAC, 1KG, EVS, etc.)? If not, any idea if it's possible to replicate in SQL on the command line at run-time?
I had a script started to do this, for starters, we'd want to store the maximum of the lower CI across ExAC populations, agreed?
@jxchong I put the simple script here: https://gist.github.com/brentp/172e213a87f3f6188891
It's runnable as python af_lim.py your.gemini.db
We'll have to figure out the details later, but it's a start. I thought it might be more useful to save the max of the max CI...
Ah cool. yes I agree max of the lower CI across all populations. I'm not exactly sure about the max CI as I can't think of an immediate use case myself, so I have no opinion.
note-to-self (and to anyone using the above gist): that calculation needs to be adjusted to use exac_num_chroms rather than 60K since the AF for some variants is estimated using only a small number of samples...
I am closing this issue for now as it resolves the original concern and the exac_num_hets and num_hom_alts have been added. I will opened #609 as a new issue for the allele frequency confidence intervals
Hello Brent,
I am not sure if this was already discussed, so I apologize if I am repeating this point below:
Since the SNPs at a given loci are normalized in GEMINI, we have 2 entries (rows) for tri-allelic regions. But when you pull from exac's AC_HOM (after it has been tidied up), you do rightly have 2 entries for each tri-allelic SNP. However, now we have lost the corresponding value for each pair of alternates after the tidy operation.
For example,
GEMINI OUTPUT:
chrom start end rs_ids ref alt exac_num_het exac_num_hom_alt
chr17 39346594 39346595 rs11283848,rs71155128 T TGCTGTGGGTCCAGCTGCTGCCAGCCTA 6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0 41827,0,0,0,0,0
ExAC.r0.3.sites.vep.tidy.vcf:
17 39346595 rs11283848 T TGCTGTGGGTCTAGCTGCTGCCAGCCTA 1.12247e+08 VQSRTrancheINDEL96.00to97.00 AC=108476,13,1,1,2,1;AC_AFR=6954,0,0,0,0,0;AC_AMR=8260,0,0,0,0,0;AC_Adj=89747,11,0,1,2,1;AC_EAS=6648,0,0,0,1,0;AC_FIN=4828,0,0,0,0,0;AC_Het=6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0;AC_Hom=41827,0,0,0,0,0;AC_NFE=48218,11,0,1,1,0;AC_OTH=689,0,0,0,0,0;AC_SAS=14150,0,0,0,0,1;AF=0.905,0.0001085,8.343e-06,8.343e-06,1.669e-05,8.343e-06;AN=119856;AN_AFR=7172;AN_AMR=8534;AN_Adj=96420;AN_EAS=7174;AN_FIN=5264;AN_NFE=52878;AN_OTH=726;AN_SAS=14672;BaseQRankSum=0.95;ClippingRankSum=-0.207;DB;DP=1429039;FS=0.525;GQ_MEAN=121.53;GQ_STDDEV=258.41;Het_AFR=194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_AMR=264,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_EAS=486,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_FIN=412,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_NFE=4191,0,0,0,0,0,11,0,1,1,0,0,0,0,0,0,0,0,0,0,0;Het_OTH=33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_SAS=499,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0;Hom_AFR=3380,0,0,0,0,0;Hom_AMR=3998,0,0,0,0,0;Hom_EAS=3081,0,0,0,0,0;Hom_FIN=2208,0,0,0,0,0;Hom_NFE=22007,0,0,0,0,0;Hom_OTH=328,0,0,0,0,0;Hom_SAS=6825,0,0,0,0,0;
ExAC original VCF (the original ExAC file has 6 alternates in that region, which determines the order the allele counts are represented in GEMINI's exac_num_het and exac_num_hom_alt columns):
17 39346595 rs11283848 T TGCTGTGGGTCCAGCTGCTGCCAGCCTA,TGCTGTGGGTCCAGCTGCTACCAGCCTA,TA,TGCTATGGGTCCAGCTGCTGCCAGCCTA,TGCTGTGGGTCCAGCTGCTGTCAGCCTA,TGCTGTGGGTCTAGCTGCTGCCAGCCTA 112247467.38VQSRTrancheINDEL96.00to97.00 AC=108476,13,1,1,2,1;AC_AFR=6954,0,0,0,0,0;AC_AMR=8260,0,0,0,0,0;AC_Adj=89747,11,0,1,2,1;AC_EAS=6648,0,0,0,1,0;AC_FIN=4828,0,0,0,0,0;AC_Het=6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0;AC_Hom=41827,0,0,0,0,0;AC_NFE=48218,11,0,1,1,0;AC_OTH=689,0,0,0,0,0;AC_SAS=14150,0,0,0,0,1;AF=0.905,1.085e-04,8.343e-06,8.343e-06,1.669e-05,8.343e-06;AN=119856;AN_AFR=7172;AN_AMR=8534;AN_Adj=96420;AN_EAS=7174;AN_FIN=5264;AN_NFE=52878;AN_OTH=726;AN_SAS=14672;BaseQRankSum=0.950;ClippingRankSum=-2.070e-01;DB;DP=1429039;FS=0.525;GQ_MEAN=121.53;GQ_STDDEV=258.41;Het_AFR=194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_AMR=264,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_EAS=486,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_FIN=412,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_NFE=4191,0,0,0,0,0,11,0,1,1,0,0,0,0,0,0,0,0,0,0,0;Het_OTH=33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_SAS=499,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0;Hom_AFR=3380,0,0,0,0,0;Hom_AMR=3998,0,0,0,0,0;Hom_EAS=3081,0,0,0,0,0;Hom_FIN=2208,0,0,0,0,0;Hom_NFE=22007,0,0,0,0,0;Hom_OTH=328,0,0,0,0,0;Hom_SAS=6825,0,0,0,0,0;
So since there wasn't a one-to-one correspondence between the alternate pattern, which gets lost when we used the tidy function, the exac_num_het and exac_num_hom_alt column's numbers do not allow us to determine which combination of alternates it belongs to.
I hope my explanation made sense and please let me know if you have any questions regarding this. Perhaps I am not understanding this correctly myself.
Kind regards, Nima Farzaneh
@nfarzaneh I'll have to think about this further when I have more time to fully understand the problem. I changed the formatting of your post a bit so I could read. It seems this is a problem with the vt decompose
method that we use to do the alternate splitting, is that what you're thinking?
Yes exactly. The alternates aren't being split properly. Sorry about the formatting. It looks like vt decompose just breaks the alternates down, but most of the information is contained in the last column of ExAC. Each of those comma-delimited values need to be corresponding to their respective alternates, which isn't the case currently in GEMINI's output for those two exac columns.
Got it. Thanks for the explanation. I'll have reopen this issue to track the problem.
What version of vt are you using?
Hello Aaron,
vt decompose v0.5.
Thanks, Nima
The ExAC tidy file I posted above is from the gemini/data/ folder and it shows the same pattern I mentioned above.
Hi @nfarzaneh it seems like you have an old version of the ExAC annotation. The md5sum should be: 93a4393680c0fb8f9c4e5ffbef755bae but you can remove it and get a new one manually with:
wget https://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz
wget https://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz.tbi
in your annotations directory. This will resolve the problem with the Hom_
fields, but the Het_
counts stuff is not working as expected in vt.
Excellent. Thanks. I can confirm that the issue has been resolved. Just a follow up question then, I have been updating using "gemini update --devel" in the past, before the vt script became part of the script.
Does this mean that I should occasionally purge my /gemini/data folder? Or is there a way you can add md5sum checks to make sure that the user has the appropriate database version in their directory?
Thanks a lot for your quick response and again much appreciated for such an amazing piece of software.
Nima
I'm not sure why the ExAC hasn't been getting updated. We do plan to add md5sum checking in the future.
Note that there are still problems with the Het_
counts, we'll have to get around the problems with decomposition in order to address those.
Having the HET counts would definitely be an added bonus in the future. The HOM counts are quite useful too! Thanks for implementing.
see: https://groups.google.com/forum/#!topic/gemini-variation/Z3PhfoNo8Nk