StoreyLab / popkin

R package to estimate kinship and FST from SNP data
19 stars 0 forks source link

Inflated heritability estimates bug #3

Closed mkelcb closed 2 years ago

mkelcb commented 2 years ago

Hi I've been comparing the performance of kinship matrices produced by PopKin vs those produced by traditional approaches (eg K=GGt/m, where G is the genotype matrix, and m is the number of SNPs).

Briefly, I simulate phenotypes with h2 set at a given %, and then I estimate h2 with LDAK/GCTA. If I use standard Kinship matrices, h2 will be estimated unbiased, ie close to the true h2 I set for the simulation. But if I generate the kinship matrix with PopKin, then h2 will be always be greater than what it should be, especially at lower true h2s.

I can give you a longer reproducible code, but I was wondering if this was perhaps an already known bug?

Martin

alexviiia commented 2 years ago

I'm closing the issue because heritability estimation is not a function of this package, which provides an unbiased kinship estimator and there is no bug to fix as far as that is concerned.

That being said, I am interested in the problem of heritability estimation, and my preliminary results are that existing approaches (GCTA in particular) are actually biased estimators of heritability, and the use of popkin ameliorates one of the sources of bias though there are others (this problem is too hard to be solved just with a good kinship estimate). I've discovered that even simulating traits with a desired heritability can be tricky in practice, so it's possible to simulate with a biased heritability and then use a biased estimator that recovers the same biased heritability. So yes, I am interested in your reproducible example if you don't mind, as I would like to know how you're simulating the trait. Although the issue is closed, we can continue the conversation here or elsewhere. Thanks!

mkelcb commented 2 years ago

The fact that it is not possible to recover a known h2 from a PopKin-derived kinship matrix indicates that there is something wrong with PopKin, even if h2 estimation is not part of your package. So please consider not dismissing this report out of hand. I was very impressed by the method and your papers and would like to use it in a project, but I cannot if this issue is not fixed.

Here is a code snippet to reproduce the bug. It simulates genotypes and phenotypes with a given h2, and then tries to recover h2 via either PopKin or standard kinship matrices. I used GCTA, but you could use LDAK or simple HE regression, the result will always be the same: inflated h2 estimates with PopKin.

# experiment params
outLoc="<WHERE YOUR OUTPUT GOES>"
gctaoloc = "<LOCATION OF GCTA EXECUTABLE: /gcta_v1.94.0Beta_windows_x86_64>"
library(genio)
library(popkin)
n=1000
numSNPs=1000
MAFs = runif(numSNPs,0.01,0.5) 
numTests=20
h2= 0.25 # 0.25  # 0.5 # 0.75
set.seed(42)

# simulate pheno with given h2
simPheno_1VC = function(X, h2){
  n = nrow(X)
  p = ncol(X)
  beta = rnorm(p, mean = 0, sd = 1) 
  GV = X%*%beta # GV = sum(X * Beta) # the breeding values are simply the linear combination of the alleles and the effects
  scale = as.numeric( sqrt( h2 / var(GV) ) )
  g = GV * scale # scale the pure genetic values 
  noise = rnorm(n, mean = 0, sd = sqrt(1-h2))
  y = g + noise
  return(y)
}

# estimate h2 via GCTA
estimate_h2 = function(K, famData, outLoc ){
  genio::write_grm(outLoc,K, fam = famData)
  cmd=paste0(gctaoloc," --reml-est-fix --reml-alg 1  --reml --grm ",outLoc," --pheno ",outLoc,".phen --out ",outLoc)
  system(cmd) 
  h2 = readLines(file(paste0(outLoc,".hsq"), "r"), 5)[5] # read in the 5th line which contains the h2 estimate
  h2 = as.numeric(strsplit(h2, "\t")[[1]] [2] ) # clean it
  return(h2)
}

# run experiment
h2_standard =c()
h2_popkin=c()
for (i in 1:numTests) {
# simulate genotype
  X_all = NULL
for (j in 1:numSNPs) {
  X_1 =rbinom(n, 2, MAFs[j]) 
  X_all = cbind(X_all,X_1)
}
# simulate pheno
y = simPheno_1VC(X_all, h2)

# generate Kinship matrices, standard and PopKin way
Xz = scale(X_all)
K_standard = Xz %*% t(Xz) / ncol(X_all)   
K_PopKin = popkin::popkin(X_all, loci_on_cols = TRUE)

# write out common data for GCTA
iddata= paste0("indi_",1:nrow(X_all)) 
famData = cbind.data.frame(iddata,iddata)
colnames(famData) = c("fam","id")
famData$pheno = scale(y)[,1]
write.table(famData,paste0(outLoc,".phen"),col.names = F,row.names = F, quote=F)

# GCTA: estimate h2 via Standard kinship matrix
h2_standard = c(h2_standard,estimate_h2(K_standard, famData, outLoc) ) 

# GCTA: estimate h2 via popkin
h2_popkin = c(h2_popkin, estimate_h2(K_PopKin, famData, outLoc) )
}

#                   true h2:   0.25       0.5        0.75
mean(h2_standard, na.rm = T) # 0.238839,  0.5229306, 0.7556879
mean(h2_popkin, na.rm = T)   # 0.4037516, 0.6750684, 0.8721976 -> very inflated!
alexviiia commented 2 years ago

Hi Martin,

I am sorry I haven't had time to respond further. I have been working on this heritability estimation problem with a student and we have some preliminary results, but there's no preprint yet or something else I'm able to share. There is related work below that you can read, not directly testing heritability estimation, but which contains relevant theoretical and empirical results.

In brief, the issue with your simulation is that you are using the sample variance var(GV) to normalize your trait, and this results in a bias, so the resulting heritability is not as specified. The heritability is a model parameter, not a statistic, so the rescaling you are using only works if using the model parameters, and not its sample estimates. In particular, the ancestral allele frequencies must also be known and used (in your code they are the MAFs, though the name is also incorrect as MAFs normally refer to sample estimates).

I wish I had time to present a more detailed proof of biases for your specific simulation. However, the key elements of such a proof are publicly available: key bias calculations involving sample variance and covariances ("second moments") using sample allele frequency estimates (Var(GV) is closely related to that category) were published in Ochoa and Storey (2021) below, and those results were used to develop a new trait simulation framework described in Yao and Ochoa (2022) and available in the R package simtrait. My code simulates traits while specifying the heritability correctly, either by using the true ancestral allele frequencies of the simulation (recommended) or by correcting for sample biases using a good estimate of the mean kinship value. The reasoning for the simulation algorithm is described in the simtrait vignette.

Ochoa and Storey (2021): https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009241

Yao and Ochoa (2022): https://www.biorxiv.org/content/10.1101/2022.03.25.485885v1

simtrait: https://cran.r-project.org/web/packages/simtrait/index.html

mkelcb commented 2 years ago

Following your advice, I re-run the experiments with generating the phenotype using the ancestral allele frequencies (via your "sim_trait" function):

# this replaces the part of the above
#y = simPheno_1VC(X_all, h2)
library(simtrait)
obj <- sim_trait(
    X = X_all,
    m_causal = ncol(X_all),
    herit = h2,
    p_anc = MAFs,
    loci_on_cols =T
  )
y = obj$trait
alexviiia commented 1 year ago

Hi Martin,

Sorry for my delayed reply, and for the essay/review I'm making you read ;), though I think I have useful tips finally:

I don't see an issue with your use of my sim_trait function in particular, so I think the explanation lies elsewhere, having to do with sample sizes and relatedness structure (see below). However, I just caught one subtle but highly consequential issue that definite should be fixed first, right away:

These other criticisms of the evaluation are also probably relevant to this troubleshooting, but regardless I'm sure knowledgeable reviewers would ask for these changes anyway:

Other notes/questions:

alexviiia commented 1 year ago

Another minor comment, I just noticed your "standard" estimate is not exactly the standard estimate used by other people in another subtle way, and you might want to consider including the true standard estimator for coherence with previous work, including GCTA's default method:

mkelcb commented 1 year ago

Thank you so much for getting back to me, and I also enjoyed the longer, detailed explanations (it is always better to answer too long than too briefly).

I am happy to report that the issue was as you suspected: I needed to multiply the Kinship matrix from PopKin by 2. After that, the difference in estimated h2s between yours and GCTA formatted kinship matrices became <1%. Even for such small sample/SNP numbers. (My actual studies are much larger, I only used fake data and 1000 indis for demonstration purposes).

I was actually aware of the distinction between coefs of kinship vs relatedness, and that their usual relationship is r =2*Ф. And I am sure that in the small print both you and GCTA mention this somewhere, however, despite being a postdoc statistical geneticist for a few years now, I was still caught out by this! So I suspect that a great many people who would want to use your method would make the same mistake too. Perhaps you could add some warning in obvious places to alert the naïve user that if they expect to take your kinship matrices to other packages, and by far the GCTA format is the most popular, then they would need to apply the 2 * K_PopKin correction.