ANGSD / angsd

Program for analysing NGS data.
228 stars 50 forks source link

SNP-based heritability #285

Open z0on opened 4 years ago

z0on commented 4 years ago

Hello - would anyone please suggest an ANGSD way to estimate SNP-based heritability of a trait? cheers Misha

nspope commented 4 years ago

Hi Misha, a common way to estimate heritability is via a simple mixed effects model with random coefficients per SNP, where

y_i = alpha + z_i' beta + epsilon_i

and y_i is trait value for ith individual, z_i is row vector of genotypes for ith individual (scaled and centered for each SNPs), beta is vector of (Gaussian) random coefficients per SNP with std deviation tau, and epsilon_i is iid (Gaussian) error with std deviation sigma. Heritability is then

h^2 = tau^2 / (tau^2 + sigma^2).

Because of the Gaussian assumption, this model simplifies (by integrating out beta) to

Y ~ multivariate_normal(alpha, tau^2 Z'Z/L + sigma^2 I)

where "I" is the identity matrix and "L" is the number of loci. Given the definition of z_i above, the matrix Z'Z/L is the genetic correlation matrix among individuals (and is the only way in which genotypes enter the model). This matrix can be estimated with ANGSD, by sampling random haplotypes or by using genotype posterior probabilities. Then, the only free parameters in the model are alpha, tau and sigma ... and it's just a generalized least squares problem. Fit it via REML using R, or scikit-learn, or whatever ...

There are more complex schemes that weight SNPs by LD and by allele frequency. Both can introduce bias. IIRC, with the Gaussian assumptions, this would just consist of rescaling rows/columns of Z. But it might take a little fiddling to figure this out with ANGSD output/input.

Nate

z0on commented 4 years ago

Hi Nate - thanks a lot for rapid response! I follow you about halfway and probably will be able to decipher the rest tomorrow. Meanwhile here is a question: with all that, would it be possible to add a matrix of geographical distances between individuals as a covariate?

Misha

On Tue, Feb 4, 2020 at 5:12 PM nspope notifications@github.com wrote:

Hi Misha, a common way to estimate heritability is via a simple mixed effects model with random coefficients per SNP, where

y_i = alpha + z_i' beta + epsilon_i

and y_i is trait value for ith individual, z_i is row vector of genotypes for ith individual (scaled and centered for each SNPs), beta is vector of (Gaussian) random coefficients per SNP with std deviation tau, and epsilon_i is iid (Gaussian) error with std deviation sigma. Heritability is then

h^2 = tau^2 / (tau^2 + sigma^2).

Because of the Gaussian assumption, this model simplifies (by integrating out beta) to

Y ~ multivariate_normal(alpha, tau^2 Z'Z/L + sigma^2 I)

where "I" is the identity matrix and "L" is the number of loci. Given the definition of z_i above, the matrix Z'Z/L is the genetic correlation matrix among individuals (and is the only way in which genotypes enter the model). This matrix can be estimated with ANGSD, by sampling random haplotypes or by using genotype posterior probabilities. Then, the only free parameters in the model are alpha, tau and sigma ... and it's just a generalized least squares problem. Fit it via REML using R, or scikit-learn, or whatever ...

There are more complex schemes that weight SNPs by LD and by allele frequency. Both can introduce bias. IIRC, with the Gaussian assumptions, this would just consist of rescaling rows/columns of Z. But it might take a little fiddling to figure this out with ANGSD output/input.

Nate

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/285?email_source=notifications&email_token=ABZUHGGIGLVL7DJYUKYV3WLRBHY4FA5CNFSM4KPLP53KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZRGXA#issuecomment-582161244, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGESQO5B7ZFX6GCVQY3RBHY4FANCNFSM4KPLP53A .

z0on commented 4 years ago

Another quick question: so you would use random-read-resampling IBS (my favorite angsd thing by far) or covariance matrix, instead of relatedness? (Which we can get from ngsRelate easily)

iPhone. Apologies for brevity

On Feb 4, 2020, at 5:12 PM, nspope notifications@github.com wrote:

 Hi Misha, a common way to estimate heritability is via a simple mixed effects model with random coefficients per SNP, where

y_i = alpha + z_i' beta + epsilon_i

and y_i is trait value for ith individual, z_i is row vector of genotypes for ith individual (scaled and centered for each SNPs), beta is vector of (Gaussian) random coefficients per SNP with std deviation tau, and epsilon_i is iid (Gaussian) error with std deviation sigma. Heritability is then

h^2 = tau^2 / (tau^2 + sigma^2).

Because of the Gaussian assumption, this model simplifies (by integrating out beta) to

Y ~ multivariate_normal(alpha, tau^2 Z'Z/L + sigma^2 I)

where "I" is the identity matrix and "L" is the number of loci. Given the definition of z_i above, the matrix Z'Z/L is the genetic correlation matrix among individuals (and is the only way in which genotypes enter the model). This matrix can be estimated with ANGSD, by sampling random haplotypes or by using genotype posterior probabilities. Then, the only free parameters in the model are alpha, tau and sigma ... and it's just a generalized least squares problem. Fit it via REML using R, or scikit-learn, or whatever ...

There are more complex schemes that weight SNPs by LD and by allele frequency. Both can introduce bias. IIRC, with the Gaussian assumptions, this would just consist of rescaling rows/columns of Z. But it might take a little fiddling to figure this out with ANGSD output/input.

Nate

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nspope commented 4 years ago

Yes ... I'm assuming you want to "account" for spatial dependence in trait when estimating heritability (or test that it is present)? For example, if some spatially-correlated, unmeasured environmental factor has an influence on phenotype but not on genotype?

If so, the errors \epsilon_i in my first equation (if you can call it that) would not be iid Gaussian but instead a spatially-correlated Gaussian. The distribution of the trait becomes Y ~ multivariate_normal(\alpha, \tau^2 Z'Z/L + \sigma^2 S), where S is a spatial correlation matrix. One possible spatial correlation structure is exp(-\gamma * distance_ij) where \gamma is the "range" of spatial dependence. Still easy to fit; just one more parameter.

nspope commented 4 years ago

To second question: yes, I would use IBS matrix or the correlation matrix derived from covariance matrix. There are a bunch of different relatedness coefficients split out by ngsRelate. I think one (some?) of them might be estimators of genetic correlation, and so reasonable to use, but I honestly can't remember

z0on commented 4 years ago

I think I got the first “equation” model, and the second one seems to be the variance structure of that model, yes? This looks familiar, I used to do these fancy variance models in MCMCglmm . But I am not sure how to fit second one directly. What is Y, the response variable, in that case?

iPhone. Apologies for brevity

On Feb 4, 2020, at 7:57 PM, nspope notifications@github.com wrote:

 Yes ... I'm assuming you want to "account" for spatial dependence in trait when estimating heritability (or test that it is present)? For example, if some spatially-correlated, unmeasured environmental factor has an influence on phenotype but not on genotype?

If so, the errors \epsilon_i in my first equation (if you can call it that) would not be iid Gaussian but instead a spatially-correlated Gaussian. The distribution of the trait becomes Y ~ multivariate_normal(\alpha, \tau^2 Z'Z/L + \sigma^2 S), where S is a spatial correlation matrix. One possible spatial correlation structure is exp(-\gamma * distance_ij). Still easy to fit; just one more parameter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

z0on commented 4 years ago

got it. maybe it is better to add another variance term, just like tau but based on spatial proximity matrix, and leave the original error term (sigma) as is?

On Feb 4, 2020, at 7:57 PM, nspope notifications@github.com wrote:

Yes ... I'm assuming you want to "account" for spatial dependence in trait when estimating heritability (or test that it is present)? For example, if some spatially-correlated, unmeasured environmental factor has an influence on phenotype but not on genotype?

If so, the errors \epsilon_i in my first equation (if you can call it that) would not be iid Gaussian but instead a spatially-correlated Gaussian. The distribution of the trait becomes Y ~ multivariate_normal(\alpha, \tau^2 Z'Z/L + \sigma^2 S), where S is a spatial correlation matrix. One possible spatial correlation structure is exp(-\gamma * distance_ij). Still easy to fit; just one more parameter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

z0on commented 4 years ago

i think i got it, Nate - in the second model we are still fitting individual trait values, only that the model has only the intercept and fancy covariance structure between those individuals. Yes?

On Feb 4, 2020, at 7:57 PM, nspope notifications@github.com wrote:

Yes ... I'm assuming you want to "account" for spatial dependence in trait when estimating heritability (or test that it is present)? For example, if some spatially-correlated, unmeasured environmental factor has an influence on phenotype but not on genotype?

If so, the errors \epsilon_i in my first equation (if you can call it that) would not be iid Gaussian but instead a spatially-correlated Gaussian. The distribution of the trait becomes Y ~ multivariate_normal(\alpha, \tau^2 Z'Z/L + \sigma^2 S), where S is a spatial correlation matrix. One possible spatial correlation structure is exp(-\gamma * distance_ij). Still easy to fit; just one more parameter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.