ajaynadig / bhr

Suite of heritability and genetic correlation estimation tools for exome-sequencing data
MIT License
31 stars 6 forks source link

subset of SNPs expalin more heritability than all SNPs #4

Closed CharleyXia closed 1 year ago

CharleyXia commented 1 year ago

Hi, I'm playing around with the BipEX example. Apart from the PTV, missense, and synonymous subsets mentioned in the tutorial, I created a sumstats for all protein-coding SNVs combined. I estimated the BHR heritability (h2_BHR) for all protein-coding SNVs and found that h2_BHR expalined by all SNVs (0.0071 +/- 0.0027) is smaller than h2_BHR explained by PTVs alone (0.0179 +/- 0.003). And I got the same pattern in my own data. Although within the same trait, the difference in point estimates of all-SNV h2_BHR and PTV-SNV h2_BHR does not differ significantly, but the pattern across traits does suggest that in general all-SNV h2_BHR is smaller than PTV-SNV h2_BHR. That does not match the additive model, where the heritability expalined by a subset of SNVs should be <= the heritability expalined by all SNVs. Any comments and solutions for this?

my results

Trait  | MAF | all.h2 | all.se | ptv.h2 | ptv.se trait1 | <10^-5 | 0.001235574 | 0.001937 | 0.006583 | 0.002394098 trait2 | <10^-5 | 0.00156597 | 0.000944 | 0.003333 | 0.001383492 trait3 | <10-4 | 0.006017338 | 0.004822 | 0.000797 | 0.003437376

codes for BipEX example are provided below

bp_sumstats_all = wrangle_sumstats(bp_variantlevel_bipex, 14210, 14422, bp_variantlevel_bipex$consequence != "non_coding") bp_all_bhr <- BHR(bp_sumstats_all, annotations = list(baseline_model), num_blocks = 100, mode = "univariate") print(paste0("BP All PC SNVs Burden Heritability: ", round(bp_all_bhr$mixed_model$heritabilities[1,5] bp_scalingfactor, 4), " ", "(", round(bp_all_bhr$mixed_model$heritabilities[2,5] bp_scalingfactor, 4), ")")) "BP All PC SNVs Burden Heritability: 0.0071 (0.0027)"

CharleyXia commented 1 year ago

After reading the methdology of BHR paper again, I think it is due to the BHR method. I am not an expert in maths but it seems to me that the h2_BHR is a function of the product of (average buren effects estimated in a selected set of genes) times (average allele frequences). On condition of the allele frequencies and genes, the (average buren effects) estimated from PTVs alone is expected to be much greater than (average buren effects) estimated from all SNVs (espcially for rare alleles), thus causing the observed problem (i.e. non-additive heritaiblity). Is that correct?

ajaynadig commented 1 year ago

Hi Charley, Thanks for your question, which relates to a crucial point about burden heritability that we could probably make more clear in the documentation.

Your second message is correct--this phenomenon is because burden heritability is different from total heritability. For total heritability, things should be additive as you suggest. However, when we calculate burden heritability under the default burden weights in BHR, variability in effect size of included variants causes the burden heritability to be lower than the total heritability.

To make this concrete, suppose that you had N PTVs, and ran BHR. Because the PTVs have very similar effects, the burden heritability is likely very close to the total heritability. Now, suppose you additionally include N null SNPs. Then, the cumulative frequency has doubled, but the mean effect size has halved. Based on equation 13 from the BHR paper, this implies that the burden heritability will be one half of the total heritability. As you can see, this decrease in burden heritability will become stronger as more null alleles are included. This is what is happening when you are including many protein-coding SNPs with effect sizes lower than the PTVs.

This phenomenon is the rationale for running BHR separately on different variant types.

Let me know if that makes sense! Others have/will have the same question, so I want to make sure we have a good answer here.

CharleyXia commented 1 year ago

Hi ajaynadig,

Thanks! That makes perfect sense. According to tihs, can you also check whether I said below is correct?

  1. I shouldn't compre h2_burden explained by all SNPs to h2_burden expalined by a subset of SNPs (e.g. PTVs or missense) becasue the h2_burden explained by all SNPs can be smaller than h2_burden expalined by a subset of SNPs. Similary, I shouldn't compre h2_burden explained by all SNPs with 0<MAF<10^4 to h2_burden expalined by all SNPs with 0<MAF<10^5 becasue the h2_burden explained by all SNPs with MAF<10^4 can be smaller than h2_burden expalined by all SNPs with MAF<10^5 eventhough the latter is a subset of ther fomer due to the same reason.

  2. I can comapre h2_burden expained by a subset of SNPs (e.g. PTVs) against h2_burden expalined by another subset of SNPs (e.g. missense). By doing so, I can tell which subset expalined more heritability.

  3. Can I add h2_burden_PTVs + h2_burden_missense together? Does the sum of them means the total h2_burden expalined by (PTVs + missense)? Or I shouldn't, because h2_burden estiamtes are non addable? Similary, can I add h2_burden_SNPs_with_10^-5<MAF<10^-4 + h2_burden_SNPs_with_0<MAF<10^-5 together to get h2_burden expalined by SNPs with MAF < 10^-4?

Thanks for your time.

ajaynadig commented 1 year ago

Hi Charley, Glad the previous answer made sense--here are answers to your questions:

  1. Yes, comparing burden h2 of sets of genes to burden h2 of subsets of that set is not a good idea for the reason you mention. Rather, you may consider comparing disjoint sets of variants (e.g. comparing burden h2 of variants with 0 < MAF < 10^-5 with burden h2 of variants with 10^-5 < MAF < 10 ^ -4. This is the style of analysis that we did in the primary BHR manuscript.

  2. Yes, although the interpretation of the difference may have some caveats. For the comparison you mention, the PTV burden heritability is probably close to the PTV total heritability because PTVs have very similar effects. However, missense variants have variable effects, and so missense burden heritability is some fraction of missense total heritability. Therefore, a difference can arise from a difference in total heritability, or a difference in (burden heritability / total heritability). Precisely, the thing you can answer is "which subset has higher burden heritability", not "which subset has more heritability".

  3. Yes, you can definitely add burden heritabilities between disjoint sets of variants (PTV/missense or across different MAF bins), and the sum of them does represent the total burden heritability. This is the analysis we do in the main BHR manuscript to get the "total burden heritability". If you look at Figure 1C, the y axis represents the sum of burden heritabilities across both functional categories (missense/PTV) and MAF bins.

Let me know if you have additional questions!

CharleyXia commented 1 year ago

Thanks! It's very helpful! I think I am done here~! Thanks again.

ajaynadig commented 1 year ago

Sounds good, feel free to reopen if additional issues arise.