Inflated heritability estimates bug

Hi I've been comparing the performance of kinship matrices produced by PopKin vs those produced by traditional approaches (eg K=GGt/m, where G is the genotype matrix, and m is the number of SNPs).

Briefly, I simulate phenotypes with h2 set at a given %, and then I estimate h2 with LDAK/GCTA. If I use standard Kinship matrices, h2 will be estimated unbiased, ie close to the true h2 I set for the simulation. But if I generate the kinship matrix with PopKin, then h2 will be always be greater than what it should be, especially at lower true h2s.

I can give you a longer reproducible code, but I was wondering if this was perhaps an already known bug?

Martin

I'm closing the issue because heritability estimation is not a function of this package, which provides an unbiased kinship estimator and there is no bug to fix as far as that is concerned.

That being said, I am interested in the problem of heritability estimation, and my preliminary results are that existing approaches (GCTA in particular) are actually biased estimators of heritability, and the use of popkin ameliorates one of the sources of bias though there are others (this problem is too hard to be solved just with a good kinship estimate). I've discovered that even simulating traits with a desired heritability can be tricky in practice, so it's possible to simulate with a biased heritability and then use a biased estimator that recovers the same biased heritability. So yes, I am interested in your reproducible example if you don't mind, as I would like to know how you're simulating the trait. Although the issue is closed, we can continue the conversation here or elsewhere. Thanks!

The fact that it is not possible to recover a known h2 from a PopKin-derived kinship matrix indicates that there is something wrong with PopKin, even if h2 estimation is not part of your package. So please consider not dismissing this report out of hand. I was very impressed by the method and your papers and would like to use it in a project, but I cannot if this issue is not fixed.

Here is a code snippet to reproduce the bug. It simulates genotypes and phenotypes with a given h2, and then tries to recover h2 via either PopKin or standard kinship matrices. I used GCTA, but you could use LDAK or simple HE regression, the result will always be the same: inflated h2 estimates with PopKin.

# experiment params
outLoc="<WHERE YOUR OUTPUT GOES>"
gctaoloc = "<LOCATION OF GCTA EXECUTABLE: /gcta_v1.94.0Beta_windows_x86_64>"
library(genio)
library(popkin)
n=1000
numSNPs=1000
MAFs = runif(numSNPs,0.01,0.5) 
numTests=20
h2= 0.25 # 0.25  # 0.5 # 0.75
set.seed(42)

# simulate pheno with given h2
simPheno_1VC = function(X, h2){
  n = nrow(X)
  p = ncol(X)
  beta = rnorm(p, mean = 0, sd = 1) 
  GV = X%*%beta # GV = sum(X * Beta) # the breeding values are simply the linear combination of the alleles and the effects
  scale = as.numeric( sqrt( h2 / var(GV) ) )
  g = GV * scale # scale the pure genetic values 
  noise = rnorm(n, mean = 0, sd = sqrt(1-h2))
  y = g + noise
  return(y)
}

# estimate h2 via GCTA
estimate_h2 = function(K, famData, outLoc ){
  genio::write_grm(outLoc,K, fam = famData)
  cmd=paste0(gctaoloc," --reml-est-fix --reml-alg 1  --reml --grm ",outLoc," --pheno ",outLoc,".phen --out ",outLoc)
  system(cmd) 
  h2 = readLines(file(paste0(outLoc,".hsq"), "r"), 5)[5] # read in the 5th line which contains the h2 estimate
  h2 = as.numeric(strsplit(h2, "\t")[[1]] [2] ) # clean it
  return(h2)
}

# run experiment
h2_standard =c()
h2_popkin=c()
for (i in 1:numTests) {
# simulate genotype
  X_all = NULL
for (j in 1:numSNPs) {
  X_1 =rbinom(n, 2, MAFs[j]) 
  X_all = cbind(X_all,X_1)
}
# simulate pheno
y = simPheno_1VC(X_all, h2)

# generate Kinship matrices, standard and PopKin way
Xz = scale(X_all)
K_standard = Xz %*% t(Xz) / ncol(X_all)   
K_PopKin = popkin::popkin(X_all, loci_on_cols = TRUE)

# write out common data for GCTA
iddata= paste0("indi_",1:nrow(X_all)) 
famData = cbind.data.frame(iddata,iddata)
colnames(famData) = c("fam","id")
famData$pheno = scale(y)[,1]
write.table(famData,paste0(outLoc,".phen"),col.names = F,row.names = F, quote=F)

# GCTA: estimate h2 via Standard kinship matrix
h2_standard = c(h2_standard,estimate_h2(K_standard, famData, outLoc) ) 

# GCTA: estimate h2 via popkin
h2_popkin = c(h2_popkin, estimate_h2(K_PopKin, famData, outLoc) )
}

#                   true h2:   0.25       0.5        0.75
mean(h2_standard, na.rm = T) # 0.238839,  0.5229306, 0.7556879
mean(h2_popkin, na.rm = T)   # 0.4037516, 0.6750684, 0.8721976 -> very inflated!

Hi Martin,

I am sorry I haven't had time to respond further. I have been working on this heritability estimation problem with a student and we have some preliminary results, but there's no preprint yet or something else I'm able to share. There is related work below that you can read, not directly testing heritability estimation, but which contains relevant theoretical and empirical results.

In brief, the issue with your simulation is that you are using the sample variance var(GV) to normalize your trait, and this results in a bias, so the resulting heritability is not as specified. The heritability is a model parameter, not a statistic, so the rescaling you are using only works if using the model parameters, and not its sample estimates. In particular, the ancestral allele frequencies must also be known and used (in your code they are the MAFs, though the name is also incorrect as MAFs normally refer to sample estimates).

I wish I had time to present a more detailed proof of biases for your specific simulation. However, the key elements of such a proof are publicly available: key bias calculations involving sample variance and covariances ("second moments") using sample allele frequency estimates (Var(GV) is closely related to that category) were published in Ochoa and Storey (2021) below, and those results were used to develop a new trait simulation framework described in Yao and Ochoa (2022) and available in the R package simtrait. My code simulates traits while specifying the heritability correctly, either by using the true ancestral allele frequencies of the simulation (recommended) or by correcting for sample biases using a good estimate of the mean kinship value. The reasoning for the simulation algorithm is described in the simtrait vignette.

Ochoa and Storey (2021): https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009241

Yao and Ochoa (2022): https://www.biorxiv.org/content/10.1101/2022.03.25.485885v1

simtrait: https://cran.r-project.org/web/packages/simtrait/index.html

Following your advice, I re-run the experiments with generating the phenotype using the ancestral allele frequencies (via your "sim_trait" function):

# this replaces the part of the above
#y = simPheno_1VC(X_all, h2)
library(simtrait)
obj <- sim_trait(
    X = X_all,
    m_causal = ncol(X_all),
    herit = h2,
    p_anc = MAFs,
    loci_on_cols =T
  )
y = obj$trait

I can confirm that this did not help: the PopKin heritability estimates are still inflated by about the same amount as when I simulated the phenotype using the sample genotype variance.

Hi Martin,

Sorry for my delayed reply, and for the essay/review I'm making you read ;), though I think I have useful tips finally:

I don't see an issue with your use of my sim_trait function in particular, so I think the explanation lies elsewhere, having to do with sample sizes and relatedness structure (see below). However, I just caught one subtle but highly consequential issue that definite should be fixed first, right away:

There are two conventions in how kinship matrices are defined, the difference being a factor of 2! My popkin estimator assumes one convention, in which kinship values are IBD probabilities, and the consequence is that the self-kinship of a totally outbred individual is 1/2, the kinship of siblings is 1/4, etc (max kinship is 1 for complete inbred individuals and their twins). GCTA uses the other convention, in which totally outbred individuals have a self-kinship value of 1, siblings of 1/2, etc (max kinship is 2!). (Read enough papers and you'll see plenty of authors in each of these two camps.) Because of that, you must pass the popkin estimate to GCTA as 2 * K_PopKin! This is a huge source of bias in this case, because passing a kinship matrix that's half as small results in heritability estimates that are twice as big when h^2 is small (the effect is dampened as h^2 approaches 1 because GCTA bounds estimates to a maximum of 1, but certainly explains most of the upward bias you're seeing).

These other criticisms of the evaluation are also probably relevant to this troubleshooting, but regardless I'm sure knowledgeable reviewers would ask for these changes anyway:

The sample sizes (1000 individuals, 1000 SNPs) are super unrealistically small. Nobody estimates heritability with these methods with less than half a million or so SNPs or less than 5000 or so individuals. Because heritability is a statistical inference problem, having a large enough sample size is critical to getting correct estimates. In my larger evaluations I typically see much smaller, subtler differences between kinship estimators, and they're only significant after 100 replicates.
- In addition, my popkin estimator is also unbiased only asymptotically (when the number of SNPs is very large), and its accuracy in particular is dependent on estimating the minimum kinship value well, which will be extremely noisy when there are only 1000 SNPs (I never went that low in my own simulations, I always use at least 100,000 SNPs). I'm starting to think this may be the next top reason you're seeing biased estimates for my estimator (after the factor of 2 issue), and that if you instead simulated at least 100,000 SNPs the bias due to finite SNPs would be negligible.
Your simulated genotypes are unstructured (binomial data, all in HWE). This is an ill-defined inference problem, because here the true kinship matrix is a multiple of the identity matrix (in my scaling convention, K = 1/2 * I; GCTA usually defines kinship as 2*K for me, though), and in that case it is not possible to infer the heritablity (because the non-genetic covariance is also a multiple of the identity matrix, the trait covariance V = sigma^2 * ( 2 * K * h^2 + (1-h^2)* I ) is actually identical for all heritability values). Because you're using estimated kinship matrices, they are not exactly the identity matrix solely due to estimation noise, but again the effect is that we're overfitting to noise entirely. Anyway, heritability estimation requires relatedness to be non-trivial, and in my tests it performs way better (lower estimation variance) when there are close relatives in the simulated data (think about how pre-genomics, and still, the gold standard of heritability estimation was using twin and sibling studies!). I know GCTA's argument was to estimate heritability in a population setting, excluding close relatives, but still there must be distant relatives (who are genome-wide more related to each other than to other people), and you don't get that at all simulating HWE data as you've done.
- The standard kinship estimator also happens to be unbiased in this specific case of unstructured data (and only in this case), which is another reason this problem is sort of the most boring version of the evaluation: this is the only case where the standard estimator is expected to perform well and it is a very unrealistic case (real data never looks like this).

Other notes/questions:

In my evaluations I use gcta too, but I use default parameters, in particular I don't use --reml-est-fix --reml-alg 1. Perhaps that explains some of the differences in results too. Anyway, as I'm not familiar with these options, do you care to explain what they do and why you chose to use such non-default parameters? If there is a difference in conclusion, can you explain why you think the non-default version is more accurate? If not, why not use defaults, which is presumably what the vast majority of researchers use in practice?

Another minor comment, I just noticed your "standard" estimate is not exactly the standard estimate used by other people in another subtle way, and you might want to consider including the true standard estimator for coherence with previous work, including GCTA's default method:

Standard 1: The usual formula normalizes genotypes at each SNP i using sqrt( 2 * p_i_est * ( 1 - p_i_est ) ), where p_i_est = mean( x_i ) / 2 is an estimate of the allele frequency. This is not the sample standard deviation estimator, instead it is motivated by statistical genetics models.
Standard 2: You used scale to standarize the genotype matrix, then computed the cross product. The issue is that scale uses the sample standard deviation to normalize the genotypes.
Both "standard" estimators will be biased compared to popkin, though I'd expect a small difference between Standard 1 and Standard 2 (perhaps the difference becomes negligible with a large enough sample size, I don't know because I didn't study your version).

Thank you so much for getting back to me, and I also enjoyed the longer, detailed explanations (it is always better to answer too long than too briefly).

I am happy to report that the issue was as you suspected: I needed to multiply the Kinship matrix from PopKin by 2. After that, the difference in estimated h2s between yours and GCTA formatted kinship matrices became <1%. Even for such small sample/SNP numbers. (My actual studies are much larger, I only used fake data and 1000 indis for demonstration purposes).

I was actually aware of the distinction between coefs of kinship vs relatedness, and that their usual relationship is r =2*Ф. And I am sure that in the small print both you and GCTA mention this somewhere, however, despite being a postdoc statistical geneticist for a few years now, I was still caught out by this! So I suspect that a great many people who would want to use your method would make the same mistake too. Perhaps you could add some warning in obvious places to alert the naïve user that if they expect to take your kinship matrices to other packages, and by far the GCTA format is the most popular, then they would need to apply the 2 * K_PopKin correction.

StoreyLab / popkin

Inflated heritability estimates bug #3