ajaynadig / bhr

Suite of heritability and genetic correlation estimation tools for exome-sequencing data
MIT License
31 stars 6 forks source link

Finding out driver gene and it's contribution through the aggregate function #9

Closed oalavijeh closed 1 year ago

oalavijeh commented 1 year ago

Firstly, thanks for the great tool! Once I got the hang of it it's very intuitive and the tutorials are super helpful.

I have run BHR for a trait of interest across three MAF bins for pLOF variation (MAF 0-0.01 across three bins as per your python script) using genebass summary stats.

Previously I have run SAIGE-GENE+ and discovered a single strongly associated gene with my phenotype.

BHR in aggregate mode says my liability adjusted h2 is 14%

Adding in the gene of interest as a fixed gene using the "fixed_gene" tag reduces that to 4%. In effect is this saying the gene makes a 10% contribution to the pLOF rare variant heritability of the phenotype?

In aggregate mode there doesn't seem to be another way of ascertaining individual genes contributions to the h2 is that correct?

Many thanks again!

ajaynadig commented 1 year ago

Hi oalavijeh,

Thanks for your kind words about our package.

Adding a "fixed_gene" does not remove the gene from heritability calculation--it only models that gene as a fixed effect rather than as a random effect. Additionally, 14% is a very high number, and is almost certainly inflated unless you are studying an exceptional trait. Here's what I suspect is happening:

  1. For some reason, when you run BHR without the fixed gene specified, BHR is identifying many spurious significant associations with its simple chi-squared test, which it then models as fixed effects
  2. When there are a large number of fixed effects, h2 is biased upwards (see our extended simulation figures from the manuscript), because fixed effects do not have the same controls for confounding as our random effects model
  3. When you specify the significant gene, BHR no longer includes those other spurious significant genes as fixed effects, decreasing the heritability estimate

There are a few ways you can confirm my assessment:

  1. How many "significant genes" are there in the model you run without setting "fixed_gene" to the SAIGE-GENE+ gene? If that number is very large, something like the above is probably going on.
  2. What is the burden heritability for synonymous variants in both of your analyses? I noticed that in your previous post here, you did not include synonymous variants when pulling summary statistics from genebass. In general, we strongly recommend performing BHR in synonymous variants alongside your analyses of pLoF and missense variants. This is a helpful negative control to identify cases where your estimates are inflated. If BHR outputs significant synonymous burden heritability for the analysis without the specified "fixed_gene", my assessment is probably correct.

All in all: I suspect that the true burden h2 number is 4%, and that 14% is an inflated estimate.

@danjweiner @lukejoconnor flagging this post, as it appears that the simple chi-sq test implemented in BHR is maybe causing more harm than help in the low-power setting. Perhaps we should consider some built in criteria where BHR decides fixed effects based on a combination of significance and effect size, rather than just significance?

oalavijeh commented 1 year ago

Dear Ajay,

Many thanks for the prompt response. I thought that was a high h2 (although our disease does have a prevalence of 0.1). In answer to your questions:

  1. I cannot see significant genes as an output when running the aggregate function but in the three MAF bins if running BHR individually there are 47,14 and 6 genes respectively from rarest to least rare bin (the same as your genebass.py script).
  2. The synonymous burden I will work out once I download the files. As alluded in my other post it has taken nearly 24hrs just to get the pLOF files let alone missense or synonymous for two phenotypes. I did add in the genomewide_correction tag as TRUE as there is evidence of rare gene enrichment in genebass for our phenotype (https://app.genebass.org/gene/undefined/phenotype/icd_first_occurrence-132036-both_sexes--?resultIndex=gene-manhattan&resultLayout=full&burdenSet=synonymous).

I have posted the first model output of these analyses (without the fixed effect) below if helpful:

usd_ptv_bhr1
$mixed_model
$mixed_model$heritabilities
      baseline_oe1_total5 baseline_oe2_total5 baseline_oe3_total5 baseline_oe4_total5        total
h2           0.0023128230        0.0011337391        0.0008576319        0.0006180671 0.0056753647
h2_se        0.0002989583        0.0002297831        0.0002155984        0.0001913720 0.0005185245

$mixed_model$fractions
               baseline_oe1_total5 baseline_oe2_total5 baseline_oe3_total5 baseline_oe4_total5
fraction_h2              0.4075197          0.19976498           0.1511148          0.10890351
fraction_h2_se           0.1168481          0.09910609           0.1031140          0.09091099

$mixed_model$fraction_burden_score
     baseline_oe1_total5 baseline_oe2_total5 baseline_oe3_total5 baseline_oe4_total5
[1,]           0.0957397           0.1965755           0.2485476           0.2578604

$mixed_model$enrichments
              baseline_oe1_total5 baseline_oe2_total5 baseline_oe3_total5 baseline_oe4_total5
enrichments              4.256538            1.016225           0.6079917           0.4223351
enrichment_se            1.220477            0.504163           0.4148663           0.3525589

$significant_genes
$significant_genes$number_significant_genes
[1] 47

$significant_genes$significant_genes
 [1] "ENSG00000164008" "ENSG00000116459" "ENSG00000162739" "ENSG00000213639" "ENSG00000119865" "ENSG00000115524"
 [7] "ENSG00000079246" "ENSG00000173585" "ENSG00000114124" "ENSG00000138663" "ENSG00000169836" "ENSG00000145592"
[13] "ENSG00000184349" "ENSG00000092421" "ENSG00000015479" "ENSG00000011426" "ENSG00000179869" "ENSG00000120162"
[19] "ENSG00000196116" "ENSG00000136810" "ENSG00000204001" "ENSG00000107485" "ENSG00000165424" "ENSG00000133895"
[25] "ENSG00000173120" "ENSG00000137509" "ENSG00000139197" "ENSG00000134532" "ENSG00000139174" "ENSG00000135108"
[31] "ENSG00000133083" "ENSG00000169926" "ENSG00000215252" "ENSG00000141002" "ENSG00000171928" "ENSG00000121067"
[37] "ENSG00000141232" "ENSG00000188629" "ENSG00000248099" "ENSG00000245848" "ENSG00000213921" "ENSG00000124466"
[43] "ENSG00000063176" "ENSG00000125844" "ENSG00000171858" "ENSG00000213923" "ENSG00000100226"

$significant_genes$fraction_burdenh2_significant
          [,1]
[1,] 0.7014219

$significant_genes$fraction_burdenh2_significant_se
           [,1]
[1,] 0.05766983

$significant_genes$sig_table
                           gene varexplained cumulative_frequency number_variants mixed_burden_h2 fraction_burdenh2
ENSG00000164008 ENSG00000164008 6.060234e-05         2.532912e-05              10     0.005675365       0.010678140
ENSG00000116459 ENSG00000116459 7.882765e-05         5.194400e-05              20     0.005675365       0.013889443
ENSG00000162739 ENSG00000162739 7.020673e-05         3.800232e-05              16     0.005675365       0.012370435
ENSG00000213639 ENSG00000213639 7.968154e-05         2.566903e-06               2     0.005675365       0.014039897
ENSG00000119865 ENSG00000119865 1.238283e-04         1.392976e-05               7     0.005675365       0.021818555
ENSG00000115524 ENSG00000115524 2.176031e-04         3.799310e-06               3     0.005675365       0.038341689
ENSG00000079246 ENSG00000079246 7.164058e-05         5.574870e-05              21     0.005675365       0.012623079
ENSG00000173585 ENSG00000173585 1.651197e-04         1.266300e-06               1     0.005675365       0.029094110
ENSG00000114124 ENSG00000114124 8.601691e-05         6.711975e-05              17     0.005675365       0.015156191
ENSG00000138663 ENSG00000138663 5.472801e-05         3.864810e-06               3     0.005675365       0.009643083
ENSG00000169836 ENSG00000169836 1.007589e-04         5.952185e-05              18     0.005675365       0.017753727
ENSG00000145592 ENSG00000145592 8.620719e-05         8.864535e-06               3     0.005675365       0.015189718
ENSG00000184349 ENSG00000184349 7.934649e-05         2.537903e-06               2     0.005675365       0.013980861
ENSG00000092421 ENSG00000092421 5.844791e-05         6.462572e-05              34     0.005675365       0.010298529
ENSG00000015479 ENSG00000015479 8.626547e-05         8.867535e-06               3     0.005675365       0.015199986
ENSG00000011426 ENSG00000011426 7.101548e-05         1.046199e-04              54     0.005675365       0.012512936
ENSG00000179869 ENSG00000179869 5.442680e-05         8.103410e-04             296     0.005675365       0.009590010
ENSG00000120162 ENSG00000120162 5.672512e-05         1.266342e-05               6     0.005675365       0.009994973
ENSG00000196116 ENSG00000196116 7.802981e-05         7.222316e-05              35     0.005675365       0.013748862
ENSG00000136810 ENSG00000136810 8.604805e-05         8.895965e-06               6     0.005675365       0.015161678
ENSG00000204001 ENSG00000204001 5.673811e-05         1.266363e-05               7     0.005675365       0.009997263
ENSG00000107485 ENSG00000107485 5.870283e-05         4.486709e-06               2     0.005675365       0.010343446
ENSG00000165424 ENSG00000165424 2.842958e-04         1.139682e-05               9     0.005675365       0.050092948
ENSG00000133895 ENSG00000133895 6.480644e-05         1.143472e-05               9     0.005675365       0.011418903
ENSG00000173120 ENSG00000173120 7.859110e-05         2.539503e-06               2     0.005675365       0.013847762
ENSG00000137509 ENSG00000137509 7.088531e-05         7.735786e-05              32     0.005675365       0.012490000
ENSG00000139197 ENSG00000139197 9.047580e-05         6.462647e-05              34     0.005675365       0.015941847
ENSG00000134532 ENSG00000134532 8.619208e-05         8.899861e-06               5     0.005675365       0.015187055
ENSG00000139174 ENSG00000139174 5.693653e-05         2.659385e-05              16     0.005675365       0.010032224
ENSG00000135108 ENSG00000135108 5.693781e-05         2.659446e-05              18     0.005675365       0.010032450
ENSG00000133083 ENSG00000133083 7.391311e-05         1.013128e-05               6     0.005675365       0.013023500
ENSG00000169926 ENSG00000169926 7.887118e-05         2.533203e-06               2     0.005675365       0.013897112
ENSG00000215252 ENSG00000215252 9.272461e-05         2.965800e-06               1     0.005675365       0.016338089
ENSG00000141002 ENSG00000141002 6.002687e-05         6.337289e-05              30     0.005675365       0.010576742
ENSG00000171928 ENSG00000171928 6.065693e-05         2.534019e-05               7     0.005675365       0.010687760
ENSG00000121067 ENSG00000121067 6.437543e-05         1.139801e-05               6     0.005675365       0.011342960
ENSG00000141232 ENSG00000141232 7.903725e-05         2.539203e-06               2     0.005675365       0.013926373
ENSG00000188629 ENSG00000188629 7.986945e-05         2.026158e-05              14     0.005675365       0.014073008
ENSG00000248099 ENSG00000248099 5.269888e-05         2.965414e-05              11     0.005675365       0.009285549
ENSG00000245848 ENSG00000245848 7.882546e-05         2.532603e-06               2     0.005675365       0.013889057
ENSG00000213921 ENSG00000213921 7.836252e-05         2.549003e-06               2     0.005675365       0.013807487
ENSG00000124466 ENSG00000124466 5.360315e-05         2.786687e-05              11     0.005675365       0.009444881
ENSG00000063176 ENSG00000063176 6.969098e-05         2.290005e-05              10     0.005675365       0.012279560
ENSG00000125844 ENSG00000125844 7.625787e-05         1.469968e-04              78     0.005675365       0.013436647
ENSG00000171858 ENSG00000171858 7.882166e-05         2.533103e-06               2     0.005675365       0.013888388
ENSG00000213923 ENSG00000213923 5.359699e-05         2.786480e-05              16     0.005675365       0.009443796
ENSG00000100226 ENSG00000100226 6.463727e-05         2.406884e-05              16     0.005675365       0.011389094

$qc
$qc$intercept
[1] 2.441873e-06

$qc$intercept_se
[1] 4.225219e-08

$qc$attenuation_ratio
[1] -0.03360636

$qc$attenuation_ratio_se
           [,1]
[1,] 0.03307114

$qc$lambda_gc
      50% 
0.8200545 

$qc$lambda_gc_se
[1] 0.03962933

$qc$mu_genome
[1] 0.001683624

$qc$mean_gammasq
[1] 2.701626e-06

$qc$mean_burdenscore
[1] 0.0001157124

$log
$log$mode
[1] "univariate"

$log$num_blocks
[1] 100

$log$genomewide_correction
[1] FALSE

$log$gwc_exclusion
NULL

$log$fixed_genes
NULL

$log$output_jackknife_h2
[1] FALSE

$log$output_jackknife_rg
[1] FALSE

$log$overdispersion
[1] FALSE

$log$all_models
[1] FALSE

$log$slope_correction
NULL

$log$num_null_conditions
[1] 5

$log$custom_weights
[1] FALSE

$log$null_stats
[1] FALSE

$log$intercept
[1] TRUE

$log$custom_variant_variances
[1] FALSE
ajaynadig commented 1 year ago

Hi oalavijeh,

  1. Based on the N for your trait of interest (~7.5k cases), I am skeptical that there are that many significant genes (unless your trait of interest has an exceptional genetic architecture). If you have separately run SAIGE-GENE+, I would recommend setting fixed_gene to the list of SAIGE-GENE+ significant genes for your main analyses.
  2. Great, once you have these sumstats you can see whether it is the case that you observe significant synonymous burden h2 and confirm that the high 14% estimate was due to inflation.

Dan should be able to respond to your query about runtime soon, but as a quick tip, Hail can be much faster when run in a cloud environment with access to many workers. We typically run these scripts on the google cloud platform. See these tutorials:

https://github.com/danking/hail-cloud-docs/blob/master/how-to-cloud.md https://github.com/danking/hail-cloud-docs/blob/master/how-to-cloud-with-hail.md

ajaynadig commented 1 year ago

Just noting here that after discussion with @lukejoconnor, we are going to try implementing a fisher's exact test rather than a chi squared test in BHR to identify significant genes in case/control designs, as this should be more stable at low N than the chi squared test. This should alleviate @oalavijeh issue of inflated burden h2 estimates due to spurious significant genes.

oalavijeh commented 1 year ago

That would be a great addition thank you! I will leave this open until the synonymous data analysed.

oalavijeh commented 1 year ago

@ajaynadig as a follow-up:

I want to know an individual genes contribution to the total BHR estimate. I row binded a number of significant gene significance tables and honed it down for my gene of interest giving me a table like so:

image

Would summing the column "fraction-burdenh2" give me this genes contribution to the total BHR estimate?

Many thanks

oalavijeh commented 1 year ago

Dear Ajay,

The synonymous contribution is 0.1% when liability adjusted and adjusted for known effects. So I guess it was a little inflated? If I don't used fixed_genes the liability adjusted inflation is 3%.

ajaynadig commented 1 year ago

Hi Omid, Apologies for the hassle in getting complex output from the aggregate functions. This is something we want to improve in the package in the medium term (@danjweiner a "significant genes table" output seems to desirable for the aggregate functions).

1) Summing fraction_burden h2 is not the correct approach to estimating this fraction. Rather, you should sum the var_explained, then sum the mixed_burden_h2, and then divide the var_explained sum by the mixed_burden_h2 sum.

2) Is the synonymous burden h2 significantly different from zero? (i.e. is h2_syn/h2_syn_SE > 1.96?) If the synonymous burden h2 is nonsignificant, that is good news.

oalavijeh commented 1 year ago

Brilliant many thanks for the help. Yes it's non significant which is good news. Look forward to the updates proposed!