Closed mkelcb closed 2 years ago
I'm closing the issue because heritability estimation is not a function of this package, which provides an unbiased kinship estimator and there is no bug to fix as far as that is concerned.
That being said, I am interested in the problem of heritability estimation, and my preliminary results are that existing approaches (GCTA in particular) are actually biased estimators of heritability, and the use of popkin ameliorates one of the sources of bias though there are others (this problem is too hard to be solved just with a good kinship estimate). I've discovered that even simulating traits with a desired heritability can be tricky in practice, so it's possible to simulate with a biased heritability and then use a biased estimator that recovers the same biased heritability. So yes, I am interested in your reproducible example if you don't mind, as I would like to know how you're simulating the trait. Although the issue is closed, we can continue the conversation here or elsewhere. Thanks!
The fact that it is not possible to recover a known h2 from a PopKin-derived kinship matrix indicates that there is something wrong with PopKin, even if h2 estimation is not part of your package. So please consider not dismissing this report out of hand. I was very impressed by the method and your papers and would like to use it in a project, but I cannot if this issue is not fixed.
Here is a code snippet to reproduce the bug. It simulates genotypes and phenotypes with a given h2, and then tries to recover h2 via either PopKin or standard kinship matrices. I used GCTA, but you could use LDAK or simple HE regression, the result will always be the same: inflated h2 estimates with PopKin.
# experiment params
outLoc="<WHERE YOUR OUTPUT GOES>"
gctaoloc = "<LOCATION OF GCTA EXECUTABLE: /gcta_v1.94.0Beta_windows_x86_64>"
library(genio)
library(popkin)
n=1000
numSNPs=1000
MAFs = runif(numSNPs,0.01,0.5)
numTests=20
h2= 0.25 # 0.25 # 0.5 # 0.75
set.seed(42)
# simulate pheno with given h2
simPheno_1VC = function(X, h2){
n = nrow(X)
p = ncol(X)
beta = rnorm(p, mean = 0, sd = 1)
GV = X%*%beta # GV = sum(X * Beta) # the breeding values are simply the linear combination of the alleles and the effects
scale = as.numeric( sqrt( h2 / var(GV) ) )
g = GV * scale # scale the pure genetic values
noise = rnorm(n, mean = 0, sd = sqrt(1-h2))
y = g + noise
return(y)
}
# estimate h2 via GCTA
estimate_h2 = function(K, famData, outLoc ){
genio::write_grm(outLoc,K, fam = famData)
cmd=paste0(gctaoloc," --reml-est-fix --reml-alg 1 --reml --grm ",outLoc," --pheno ",outLoc,".phen --out ",outLoc)
system(cmd)
h2 = readLines(file(paste0(outLoc,".hsq"), "r"), 5)[5] # read in the 5th line which contains the h2 estimate
h2 = as.numeric(strsplit(h2, "\t")[[1]] [2] ) # clean it
return(h2)
}
# run experiment
h2_standard =c()
h2_popkin=c()
for (i in 1:numTests) {
# simulate genotype
X_all = NULL
for (j in 1:numSNPs) {
X_1 =rbinom(n, 2, MAFs[j])
X_all = cbind(X_all,X_1)
}
# simulate pheno
y = simPheno_1VC(X_all, h2)
# generate Kinship matrices, standard and PopKin way
Xz = scale(X_all)
K_standard = Xz %*% t(Xz) / ncol(X_all)
K_PopKin = popkin::popkin(X_all, loci_on_cols = TRUE)
# write out common data for GCTA
iddata= paste0("indi_",1:nrow(X_all))
famData = cbind.data.frame(iddata,iddata)
colnames(famData) = c("fam","id")
famData$pheno = scale(y)[,1]
write.table(famData,paste0(outLoc,".phen"),col.names = F,row.names = F, quote=F)
# GCTA: estimate h2 via Standard kinship matrix
h2_standard = c(h2_standard,estimate_h2(K_standard, famData, outLoc) )
# GCTA: estimate h2 via popkin
h2_popkin = c(h2_popkin, estimate_h2(K_PopKin, famData, outLoc) )
}
# true h2: 0.25 0.5 0.75
mean(h2_standard, na.rm = T) # 0.238839, 0.5229306, 0.7556879
mean(h2_popkin, na.rm = T) # 0.4037516, 0.6750684, 0.8721976 -> very inflated!
Hi Martin,
I am sorry I haven't had time to respond further. I have been working on this heritability estimation problem with a student and we have some preliminary results, but there's no preprint yet or something else I'm able to share. There is related work below that you can read, not directly testing heritability estimation, but which contains relevant theoretical and empirical results.
In brief, the issue with your simulation is that you are using the sample variance var(GV)
to normalize your trait, and this results in a bias, so the resulting heritability is not as specified. The heritability is a model parameter, not a statistic, so the rescaling you are using only works if using the model parameters, and not its sample estimates. In particular, the ancestral allele frequencies must also be known and used (in your code they are the MAFs
, though the name is also incorrect as MAFs normally refer to sample estimates).
I wish I had time to present a more detailed proof of biases for your specific simulation. However, the key elements of such a proof are publicly available: key bias calculations involving sample variance and covariances ("second moments") using sample allele frequency estimates (Var(GV)
is closely related to that category) were published in Ochoa and Storey (2021) below, and those results were used to develop a new trait simulation framework described in Yao and Ochoa (2022) and available in the R package simtrait
. My code simulates traits while specifying the heritability correctly, either by using the true ancestral allele frequencies of the simulation (recommended) or by correcting for sample biases using a good estimate of the mean kinship value. The reasoning for the simulation algorithm is described in the simtrait
vignette.
Ochoa and Storey (2021): https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009241
Yao and Ochoa (2022): https://www.biorxiv.org/content/10.1101/2022.03.25.485885v1
simtrait: https://cran.r-project.org/web/packages/simtrait/index.html
Following your advice, I re-run the experiments with generating the phenotype using the ancestral allele frequencies (via your "sim_trait" function):
# this replaces the part of the above
#y = simPheno_1VC(X_all, h2)
library(simtrait)
obj <- sim_trait(
X = X_all,
m_causal = ncol(X_all),
herit = h2,
p_anc = MAFs,
loci_on_cols =T
)
y = obj$trait
Hi Martin,
Sorry for my delayed reply, and for the essay/review I'm making you read ;), though I think I have useful tips finally:
I don't see an issue with your use of my sim_trait
function in particular, so I think the explanation lies elsewhere, having to do with sample sizes and relatedness structure (see below). However, I just caught one subtle but highly consequential issue that definite should be fixed first, right away:
2 * K_PopKin
! This is a huge source of bias in this case, because passing a kinship matrix that's half as small results in heritability estimates that are twice as big when h^2 is small (the effect is dampened as h^2 approaches 1 because GCTA bounds estimates to a maximum of 1, but certainly explains most of the upward bias you're seeing).These other criticisms of the evaluation are also probably relevant to this troubleshooting, but regardless I'm sure knowledgeable reviewers would ask for these changes anyway:
K = 1/2 * I
; GCTA usually defines kinship as 2*K
for me, though), and in that case it is not possible to infer the heritablity (because the non-genetic covariance is also a multiple of the identity matrix, the trait covariance V = sigma^2 * ( 2 * K * h^2 + (1-h^2)* I )
is actually identical for all heritability values). Because you're using estimated kinship matrices, they are not exactly the identity matrix solely due to estimation noise, but again the effect is that we're overfitting to noise entirely. Anyway, heritability estimation requires relatedness to be non-trivial, and in my tests it performs way better (lower estimation variance) when there are close relatives in the simulated data (think about how pre-genomics, and still, the gold standard of heritability estimation was using twin and sibling studies!). I know GCTA's argument was to estimate heritability in a population setting, excluding close relatives, but still there must be distant relatives (who are genome-wide more related to each other than to other people), and you don't get that at all simulating HWE data as you've done.
Other notes/questions:
gcta
too, but I use default parameters, in particular I don't use --reml-est-fix --reml-alg 1
. Perhaps that explains some of the differences in results too. Anyway, as I'm not familiar with these options, do you care to explain what they do and why you chose to use such non-default parameters? If there is a difference in conclusion, can you explain why you think the non-default version is more accurate? If not, why not use defaults, which is presumably what the vast majority of researchers use in practice?Another minor comment, I just noticed your "standard" estimate is not exactly the standard estimate used by other people in another subtle way, and you might want to consider including the true standard estimator for coherence with previous work, including GCTA's default method:
i
using sqrt( 2 * p_i_est * ( 1 - p_i_est ) )
, where p_i_est = mean( x_i ) / 2
is an estimate of the allele frequency. This is not the sample standard deviation estimator, instead it is motivated by statistical genetics models.scale
to standarize the genotype matrix, then computed the cross product. The issue is that scale
uses the sample standard deviation to normalize the genotypes.Thank you so much for getting back to me, and I also enjoyed the longer, detailed explanations (it is always better to answer too long than too briefly).
I am happy to report that the issue was as you suspected: I needed to multiply the Kinship matrix from PopKin by 2. After that, the difference in estimated h2s between yours and GCTA formatted kinship matrices became <1%. Even for such small sample/SNP numbers. (My actual studies are much larger, I only used fake data and 1000 indis for demonstration purposes).
I was actually aware of the distinction between coefs of kinship vs relatedness, and that their usual relationship is r =2*Ф
. And I am sure that in the small print both you and GCTA mention this somewhere, however, despite being a postdoc statistical geneticist for a few years now, I was still caught out by this! So I suspect that a great many people who would want to use your method would make the same mistake too. Perhaps you could add some warning in obvious places to alert the naïve user that if they expect to take your kinship matrices to other packages, and by far the GCTA format is the most popular, then they would need to apply the 2 * K_PopKin
correction.
Hi I've been comparing the performance of kinship matrices produced by PopKin vs those produced by traditional approaches (eg K=GGt/m, where G is the genotype matrix, and m is the number of SNPs).
Briefly, I simulate phenotypes with h2 set at a given %, and then I estimate h2 with LDAK/GCTA. If I use standard Kinship matrices, h2 will be estimated unbiased, ie close to the true h2 I set for the simulation. But if I generate the kinship matrix with PopKin, then h2 will be always be greater than what it should be, especially at lower true h2s.
I can give you a longer reproducible code, but I was wondering if this was perhaps an already known bug?
Martin