StatGenPRD / HIBAGImputation

Scripts for facilitating execution of HIBAG to infer classical HLA genotypes for subjects based on SNV genotypes using public domain reference datasets
0 stars 3 forks source link

Need to confirm Rsq calculation #2

Open andrew-slater opened 9 years ago

andrew-slater commented 9 years ago

Based on minimac formula, seems Rsq ranges from 0-1 if the "Counts" in the numerator are the doses divided by 2 to scale from 0-1 (0 for homozygote Al2, 0.5 for het, and 1 for homozygote Al1) and p is the allele frequency (Freq1 which is the mean of the doses divided by 2). This is the formula currently (45cb8a544af85c13fbf88c57df00b8adb1e93f1d) used:

    #Calculate rsq - need to confirm this formula
    xbar <- colMeans(dose[,-c(1,2)]/2)
    x2bar <- colMeans((dose[,-c(1,2)]/2)^2)
    #Rounding to 5 to match minimac output
    info.df$Rsq = round((x2bar - xbar^2)/(xbar*(1-xbar)),5)
    #Replacing NaNs due to division by 0 when xbar is 0 with 0
    info.df$Rsq[is.na(info.df$Rsq)] = 0

With these conditions, when the genotypes are in perfect Hardy-Weinberg equilibrium, Rsq is 0.5. As they deviate towards an excess of hets, Rsq decreases towards 0. As they deviate towards an excess of homozygotes, Rsq increases towards 1.

Note, the statement as of the 31 January 2013 instance of the minimac wiki page referenced above, "poorly imputed genotype counts will shrink towards their expectations based on population allele frequencies alone," is somewhat confusing. It implies perfect Hardy-Weinberg equilibrium (Rsq 0.5) indicates a poorly imputed variant and deviations from this (Rsq greater than or less than 0.5) indicate higher quality imputation. The statement in Li et al 2009, "imputed allele counts for poorly imputed markers show less variability than expected based on allele frequency," is clearer. It implies that imputation quality decreases as the doses deviate from Hardy-Weinberg equilibrium (Rsq 0.5) towards an excess of hets (Rsq < 0.5). This aligns with the commonly used exclusion criteria of Rsq < 0.3.

andrew-slater commented 9 years ago

Upon further investigation, we observe Rsq values in minimac output are ~double what are produced by this formula and can be > 1. We can approximate these values using raw dose (not scaled) as counts and multiplying the denominator by 2 (9e1bff516f39250bf8a767f820e32418f75f7832):

  #Calculate rsq - need to confirm this formula
  xbar <- colMeans(dose[,-c(1,2)])
  x2bar <- colMeans((dose[,-c(1,2)])^2)
  #Rounding to 5 to match minimac output
  info.df$Rsq = round((x2bar - xbar^2)/(2*(xbar/2)*(1-xbar/2)),5)
  #Replacing NaNs due to division by 0 when xbar is 0 with 0
  info.df$Rsq[is.na(info.df$Rsq)] = 0

This formula does not give the exact same values as minimac since minimac calculates at the haplotype level, not genotype (per Toby).

The same relationship with Hardy-Weinberg equilibrium stands: Rsq 1 is perfect equilibrium. As doses deviate towards excess of hets, Rsq decreases towards 0. As doses deviate towards an excess of homozygotes, Rsq increases towards 2.

sks2772 commented 7 years ago

@andrew-slater Hi, I hope you are able to help with this. I am new to imputation and programming. I was trying to understand how the accuracy of the genetic imputation process is assessed. I understand that r2 metric is available in the .info file and can be used for this. As you have said in your first post above, I have seen people use the r2 threshold of 0.3, but I was unable to get a reference for this. Can you suggest any? Many thanks Sam

andrew-slater commented 7 years ago

Sorry, I don't know of a reference off the top of my head. You could try reaching out to the minimac3 group. Although, as with any quality metric, the appropriate threshold will vary depending on your data and what you intend to do with it.

sks2772 commented 7 years ago

Hi Andrew Thank you for taking time to respond to my email. Kind regards

On Mon, Jun 26, 2017 at 6:23 PM, Andrew Slater notifications@github.com wrote:

Sorry, I don't know of a reference off the top of my head. You could try reaching out to the minimac3 group https://groups.google.com/forum/embed/?place=forum/minimac4-help#!forum/minimac4-help. Although, as with any quality metric, the appropriate threshold will vary depending on your data and what you intend to do with it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StatGenPRD/HIBAGImputation/issues/2#issuecomment-311125605, or mute the thread https://github.com/notifications/unsubscribe-auth/AcVeYIOXp7VZUL4oVQI_i6PV1rbChyNWks5sH-ksgaJpZM4FaWL8 .