arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

add aaf_exac_all to the calculation of max_aaf_all #568

Open brentp opened 8 years ago

brentp commented 8 years ago

see: https://groups.google.com/forum/#!topic/gemini-variation/Z3PhfoNo8Nk

jxchong commented 8 years ago

Hmm, so I actually think the current max_aaf_all value of 0 is "correct" in so far as it has the desired value. The ExAC webpage shows this variant as a low quality site and appears to be deliberately reporting the AAF on the website as NA because the AAF cannot be reliably determined when the site is poor quality.

Taken to an extreme, if a variant is only called in 1 het person in ExAC, I certainly wouldn't want max_aaf_all to report AAF=1.0 (or any population AAF for that matter) because it would lead me to accidentally exclude a variant that could very well be only present in that single person if one were to do resequencing and determine everyone's actual genotype at that site.

brentp commented 8 years ago

Aye, that makes sense.

Aaron has been pushing for exac_num_homalts and exac_num_hets columns that would be informative for these cases.

We may also expose the FILTER field from the ExAC VCF.

jxchong commented 8 years ago

Yeah, I’m not sure the best way to handle this for the max_aaf_all field since you don’t have num_hom_alts, num_hets values for any of the other databases (plus if you start making it so that people have to specify those additional columns, it defeats the simplification goal of max_aaf_all).

Actually more generally, it also brings up as a question how EVS and 1KG are reporting AAF when the site is only called in, for example, <80% of samples (that’s the cutoff ExAC appears to be using).

nfarzaneh commented 8 years ago

Adding the exac_num_homalts and exac_num_hets columns would be immensely useful for our group as well since we have faced similar situations mentioned above. Is there an approximate timeline for this?

Thanks a lot! Nima

brentp commented 8 years ago

I'll get it in for the next release. Would an aggregate column of the sum of num_homalts across all populations suffice or would it be better to have 1 column per population?

jxchong commented 8 years ago

I would suggest taking ExAC’s approach and for a convenience variable such as max_aaf_all, always conservatively assume NA/Null for the allele frequency if there is high missingness among the sequenced samples.

brentp commented 8 years ago

So, report non-NULL for the proposed fields only if FILTER == PASS ?

jxchong commented 8 years ago

Maybe not based on FILTER but based on the adjusted allele number?  e.g. for the variant I originally noticed, ExAC’s webpage says “Warning! This variant is only covered in 12 individuals (adjusted allele number = 24).”  — that’s what we want to ignore — the variants where there are just too few samples with a genotype to possibly obtain a reasonably accurate allele frequency.

http://exac.broadinstitute.org/variant/1-207726161-G-T

brentp commented 8 years ago

OK. So we need a reasonable cutoff for AN_Adj (24 is the value for this case). Suggestions? Seems like we should put the in the db as well.

jxchong commented 8 years ago

I’ve been contemplating this for a little while but never got very far, but technically there’s a way to estimate the confidence interval of a given allele frequency based on sample size… :) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.7107&rep=rep1&type=pdf

jxchong commented 8 years ago

Brent Werness (@BrentWerness) had previously suggested to me taking the 95% confidence interval, using "a fancy formula like method 4 on p.859 if you can"

brentp commented 8 years ago

I implemented eqn 4 from that paper as below. Seems to give reasonable results, though the bounds are tighter than I expected (see doctests). The example from the mailing list gives a lower bound of 0: af_lims(0.001428, 24) gives 0, 0.11 for the lower and upper bounds.

I'll have to have another look tomorrow to decide whether the 2nd arg is n_samples or n_samples * 2.

from math import sqrt

def af_lims(af, n_samples, alpha=0.05):
    """
    calculate the confiedence limits (alpha, 1 - alpha) of allele frequency
    (af) from n_samples
    Uses eqn 4 from:
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.7107&rep=rep1&type=pdf

    AF of 0.2 with varying samples
    >>> af_lims(0.2, 5) #doctest: +ELLIPSIS
    (0.120663..., 0.39002...)

    >>> af_lims(0.2, 10) #doctest: +ELLIPSIS
    (0.1635..., 0.2908...)

    >>> af_lims(0.2, 100) #doctest: +ELLIPSIS
    (0.19727..., 0.208385...)

    >>> af_lims(0.2, 1000) #doctest: +ELLIPSIS
    (0.199766..., 0.200803...)

    AF of 0.002 with varying samples
    >>> af_lims(0.002, 5) #doctest: +ELLIPSIS
    (0.0, 1.0)

    >>> af_lims(0.002, 10) #doctest: +ELLIPSIS
    (0.0, 1.0)

    >>> af_lims(0.002, 100) #doctest: +ELLIPSIS
    (0.0, 0.01714...)

    >>> af_lims(0.002, 1000) #doctest: +ELLIPSIS
    (0.001801..., 0.00314...)

    """

    p, n = af, n_samples
    z = 1 - alpha / 2
    q = 1 - p

    a = 2*n*p + z **2
    denom = 2 * (n + z**2)

    try:
        bL = z / sqrt((z**2 - 2 - 1) / n + 4*p*(n*q+1))
        L = (a - 1 - bL) / denom
    except ValueError:
        L = 0.0

    try:
        bU = z / sqrt((z**2 - 2 - 1) / n + 4*p*(n*q-1))
        U = (a + 1 + bU) / denom
    except ValueError:
        U = 1.0
    return max(L, 0.0), min(U, 1.0)

print af_lims(0.001428, 24)

if __name__ == "__main__":
    import doctest
    doctest.testmod()
jxchong commented 8 years ago

I believe you have a few transcription errors -- your z is wrong and it should be 1/n:

Here's my R code:

af_lims <- function (p,n) {
    q <- 1-p
    z <- 1.96

    a <- 2*n*p+z^2
    denom <- (2*(n+z^2))

    L <- (a -1 -z* sqrt(z^2 -2 -1/n + 4*p*(n*q+1))) / denom
    U <- (a +1 +z* sqrt(z^2 +2 -1/n + 4*p*(n*q-1))) / denom

    paste(L, U)
}

# example Table I in paper (page 861)
af_lims(81/263, 263)
af_lims(1/29, 29)

# example from my original post
af_lims(0.001428, 24)
jxchong commented 8 years ago

One thing Brent (Brent Werness) and I have talked about is that if you are going to go through the effort of implementing confidence intervals for allele frequency estimates, you should store both upper, "empirical," and lower values for the allele frequency.

Since I do Mendelian analysis, my main concern is being cautious and not accidentally excluding a variant that is actually rare but appears to be common because of sequencing issues. Therefore, I would always choose to use the lower limit of the confidence interval for my filtering.

Perhaps someone doing some other analysis (I can't think of one offhand, but who knows) might be concerned with not accidentally including a rare variant, in which case, they would want to use the upper limit of the confidence interval.

Other people might be decide this is all too complex and use the empirical allele frequency (though as we can see here, this is a bad idea!) :)

brentp commented 8 years ago

The intervals look more sane after I update the code to be correct as you suggest: https://gist.github.com/brentp/172e213a87f3f6188891

We are thinking to make this available to call somehow given that the user has AC and AN.

Other options are to include an aggregate column like max_aaf_all for min(lower CI) and one for max(upper CI) across ExAC,ESP,1KG

jxchong commented 8 years ago

I like the aggregate option because it is easy/simple to replicate without mistakes over the long run and because now that you've gone through the effort of implementing this, I'd personally choose to use max_aaf_all_lowerCI over max_aaf_all! ;)

brentp commented 8 years ago

hi everyone, the columns for exac (exac_num_het, exac_num_hom_alt, exac_num_chroms) are now available in master. we have a few things to go in before the next release, but you can access with gemini update --devel let us know any feedback.

jxchong commented 8 years ago

Cool! Are you eventually (future release) planning to do the confidence interval thing for each of the reference databases? (ExAC, 1KG, EVS, etc.)? If not, any idea if it's possible to replicate in SQL on the command line at run-time?

brentp commented 8 years ago

I had a script started to do this, for starters, we'd want to store the maximum of the lower CI across ExAC populations, agreed?

brentp commented 8 years ago

@jxchong I put the simple script here: https://gist.github.com/brentp/172e213a87f3f6188891

It's runnable as python af_lim.py your.gemini.db

We'll have to figure out the details later, but it's a start. I thought it might be more useful to save the max of the max CI...

jxchong commented 8 years ago

Ah cool. yes I agree max of the lower CI across all populations. I'm not exactly sure about the max CI as I can't think of an immediate use case myself, so I have no opinion.

brentp commented 8 years ago

note-to-self (and to anyone using the above gist): that calculation needs to be adjusted to use exac_num_chroms rather than 60K since the AF for some variants is estimated using only a small number of samples...

brentp commented 8 years ago

I am closing this issue for now as it resolves the original concern and the exac_num_hets and num_hom_alts have been added. I will opened #609 as a new issue for the allele frequency confidence intervals

nfarzaneh commented 8 years ago

Hello Brent,

I am not sure if this was already discussed, so I apologize if I am repeating this point below:

Since the SNPs at a given loci are normalized in GEMINI, we have 2 entries (rows) for tri-allelic regions. But when you pull from exac's AC_HOM (after it has been tidied up), you do rightly have 2 entries for each tri-allelic SNP. However, now we have lost the corresponding value for each pair of alternates after the tidy operation.

For example,

GEMINI OUTPUT:

chrom   start   end rs_ids  ref alt exac_num_het    exac_num_hom_alt
chr17   39346594    39346595    rs11283848,rs71155128   T   TGCTGTGGGTCCAGCTGCTGCCAGCCTA    6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0   41827,0,0,0,0,0

ExAC.r0.3.sites.vep.tidy.vcf:

17  39346595    rs11283848  T   TGCTGTGGGTCTAGCTGCTGCCAGCCTA    1.12247e+08 VQSRTrancheINDEL96.00to97.00    AC=108476,13,1,1,2,1;AC_AFR=6954,0,0,0,0,0;AC_AMR=8260,0,0,0,0,0;AC_Adj=89747,11,0,1,2,1;AC_EAS=6648,0,0,0,1,0;AC_FIN=4828,0,0,0,0,0;AC_Het=6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0;AC_Hom=41827,0,0,0,0,0;AC_NFE=48218,11,0,1,1,0;AC_OTH=689,0,0,0,0,0;AC_SAS=14150,0,0,0,0,1;AF=0.905,0.0001085,8.343e-06,8.343e-06,1.669e-05,8.343e-06;AN=119856;AN_AFR=7172;AN_AMR=8534;AN_Adj=96420;AN_EAS=7174;AN_FIN=5264;AN_NFE=52878;AN_OTH=726;AN_SAS=14672;BaseQRankSum=0.95;ClippingRankSum=-0.207;DB;DP=1429039;FS=0.525;GQ_MEAN=121.53;GQ_STDDEV=258.41;Het_AFR=194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_AMR=264,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_EAS=486,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_FIN=412,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_NFE=4191,0,0,0,0,0,11,0,1,1,0,0,0,0,0,0,0,0,0,0,0;Het_OTH=33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_SAS=499,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0;Hom_AFR=3380,0,0,0,0,0;Hom_AMR=3998,0,0,0,0,0;Hom_EAS=3081,0,0,0,0,0;Hom_FIN=2208,0,0,0,0,0;Hom_NFE=22007,0,0,0,0,0;Hom_OTH=328,0,0,0,0,0;Hom_SAS=6825,0,0,0,0,0;

ExAC original VCF (the original ExAC file has 6 alternates in that region, which determines the order the allele counts are represented in GEMINI's exac_num_het and exac_num_hom_alt columns):

17  39346595    rs11283848  T   TGCTGTGGGTCCAGCTGCTGCCAGCCTA,TGCTGTGGGTCCAGCTGCTACCAGCCTA,TA,TGCTATGGGTCCAGCTGCTGCCAGCCTA,TGCTGTGGGTCCAGCTGCTGTCAGCCTA,TGCTGTGGGTCTAGCTGCTGCCAGCCTA 112247467.38VQSRTrancheINDEL96.00to97.00    AC=108476,13,1,1,2,1;AC_AFR=6954,0,0,0,0,0;AC_AMR=8260,0,0,0,0,0;AC_Adj=89747,11,0,1,2,1;AC_EAS=6648,0,0,0,1,0;AC_FIN=4828,0,0,0,0,0;AC_Het=6079,0,0,0,1,0,11,0,1,1,1,0,0,0,0,0,0,0,0,0,0;AC_Hom=41827,0,0,0,0,0;AC_NFE=48218,11,0,1,1,0;AC_OTH=689,0,0,0,0,0;AC_SAS=14150,0,0,0,0,1;AF=0.905,1.085e-04,8.343e-06,8.343e-06,1.669e-05,8.343e-06;AN=119856;AN_AFR=7172;AN_AMR=8534;AN_Adj=96420;AN_EAS=7174;AN_FIN=5264;AN_NFE=52878;AN_OTH=726;AN_SAS=14672;BaseQRankSum=0.950;ClippingRankSum=-2.070e-01;DB;DP=1429039;FS=0.525;GQ_MEAN=121.53;GQ_STDDEV=258.41;Het_AFR=194,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_AMR=264,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_EAS=486,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_FIN=412,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_NFE=4191,0,0,0,0,0,11,0,1,1,0,0,0,0,0,0,0,0,0,0,0;Het_OTH=33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;Het_SAS=499,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0;Hom_AFR=3380,0,0,0,0,0;Hom_AMR=3998,0,0,0,0,0;Hom_EAS=3081,0,0,0,0,0;Hom_FIN=2208,0,0,0,0,0;Hom_NFE=22007,0,0,0,0,0;Hom_OTH=328,0,0,0,0,0;Hom_SAS=6825,0,0,0,0,0;

So since there wasn't a one-to-one correspondence between the alternate pattern, which gets lost when we used the tidy function, the exac_num_het and exac_num_hom_alt column's numbers do not allow us to determine which combination of alternates it belongs to.

I hope my explanation made sense and please let me know if you have any questions regarding this. Perhaps I am not understanding this correctly myself.

Kind regards, Nima Farzaneh

brentp commented 8 years ago

@nfarzaneh I'll have to think about this further when I have more time to fully understand the problem. I changed the formatting of your post a bit so I could read. It seems this is a problem with the vt decompose method that we use to do the alternate splitting, is that what you're thinking?

nfarzaneh commented 8 years ago

Yes exactly. The alternates aren't being split properly. Sorry about the formatting. It looks like vt decompose just breaks the alternates down, but most of the information is contained in the last column of ExAC. Each of those comma-delimited values need to be corresponding to their respective alternates, which isn't the case currently in GEMINI's output for those two exac columns.

brentp commented 8 years ago

Got it. Thanks for the explanation. I'll have reopen this issue to track the problem.

arq5x commented 8 years ago

What version of vt are you using?

nfarzaneh commented 8 years ago

Hello Aaron,

vt decompose v0.5.

Thanks, Nima

nfarzaneh commented 8 years ago

The ExAC tidy file I posted above is from the gemini/data/ folder and it shows the same pattern I mentioned above.

brentp commented 8 years ago

Hi @nfarzaneh it seems like you have an old version of the ExAC annotation. The md5sum should be: 93a4393680c0fb8f9c4e5ffbef755bae but you can remove it and get a new one manually with:

wget https://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz
wget https://s3.amazonaws.com/gemini-annotations/ExAC.r0.3.sites.vep.tidy.vcf.gz.tbi

in your annotations directory. This will resolve the problem with the Hom_ fields, but the Het_ counts stuff is not working as expected in vt.

nfarzaneh commented 8 years ago

Excellent. Thanks. I can confirm that the issue has been resolved. Just a follow up question then, I have been updating using "gemini update --devel" in the past, before the vt script became part of the script.

Does this mean that I should occasionally purge my /gemini/data folder? Or is there a way you can add md5sum checks to make sure that the user has the appropriate database version in their directory?

Thanks a lot for your quick response and again much appreciated for such an amazing piece of software.

Nima

brentp commented 8 years ago

I'm not sure why the ExAC hasn't been getting updated. We do plan to add md5sum checking in the future. Note that there are still problems with the Het_ counts, we'll have to get around the problems with decomposition in order to address those.

nfarzaneh commented 8 years ago

Having the HET counts would definitely be an added bonus in the future. The HOM counts are quite useful too! Thanks for implementing.