ajaynadig / bhr

Suite of heritability and genetic correlation estimation tools for exome-sequencing data
MIT License
31 stars 6 forks source link

Error while running BHR #1

Closed bibb closed 1 year ago

bibb commented 1 year ago

Hello,

I'm trying to run BHR in a bivariate model. I have followed the instructions in the wiki and the examples and when I run my data I have this error:

> BHR(mode = "bivariate", 
+     trait1_sumstats = sumstats_1,
+     trait2_sumstats = sumstats_2,
+       annotations = baseline)
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2022
Running BHR at 2022-12-24 14:15:27
For trait 1, MAF ranges from 2e-05 to 0.01
MAF ranges over more than two orders of magnitude for trait 1. We recommend splitting into finer MAF bins. See documentation
For trait 1, MAF ranges from 2e-05 to 0.01
MAF ranges over more than two of magnitude for trait 2. We recommend splitting into finer MAF bins. See documentation
Error in UseMethod("inner_join") : 
  no applicable method for 'inner_join' applied to an object of class "factor"

I tried googling up the error but I couldn't find anything useful. I checked that all data was loaded correctly and it follows the same format as the one presented in the "preparing input files" section. Would appreciate if you could help me sorting this out.

Best,

Bernabe

ajaynadig commented 1 year ago

Hi Bernabe, Thank you for using BHR!

I think the immediate issue is probably that you need to wrap "baseline" in a list command, e.g. replacebaselinewith list(baseline). We structured the command this way so that users could easily add additional annotations. Does the command work when you try this?

Secondly, as noted by the command output, it seems like you are running BHR with variants from a wide AF range. BHR can produced biased estimates when it is run on such a wide range. Additionally, we have found that at allele frequencies above 0.001, there is both very little signal, and a substantial amount of bias, perhaps due to linkage disequilibrium.

Practically, based on the output you pasted, I would recommend that you divide your variants into two bins, 1e-5 - 1e-4 and 1e-4 - 1e-3, and run BHR separately on those two bins. I suspect that both bins will give you similar results, but that the rarer bin will have better powered estimates (e.g. lower SE).

Please feel free to follow up with additional questions--we are eager to troubleshoot BHR and make it user friendly.

bibb commented 1 year ago

Thank you so much for your quick answer. I appreciate it. I changed baseline to list(baseline) and it worked, and I also filtered variants so now I have AF < 0.001. With that, I have no longer the warning about AFs being too different. Now I have a different problem, please see code:

> BHR(mode = "bivariate", 
+     trait1_sumstats = sumstats_1,
+     trait2_sumstats = sumstats_2,
+       annotations = list(baseline))
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2022
Running BHR at 2022-12-26 10:09:19
For trait 1, MAF ranges from 2e-05 to 0.00099
...seems reasonable
For trait 1, MAF ranges from 2e-05 to 0.00099
...seems reasonable
Lambda GC:  74.9638785885296
Lambda GC seems very high. Please check input columns against documentation
14002 genes in the BHR regression, 9538 of which are significant fixed effects
A large fraction of genes are exome-wide significant. This may indicate that effect sizes and/or allele frequencies/variances have been miscomputed
Error in randomeffects_jackknife(sumstats_sub, merged_annotations, num_blocks,  : 
  argument "intercept" is missing, with no default

I'm not sure if the error is due the big lambda GC or is about something else. Also, in my analysis I include a lot of singleton sites, where I have an AC of 1 in cases and 0 in controls, therefore the betas are huge for them. Would you recommend me to restrict my analyses to non singleton variants so I have more reasonable betas? That would delete most of my vairants though... Thanks again!

ajaynadig commented 1 year ago

Hi Bernabe,

  1. The "intercept" error is due to a bug in the code; I have fixed the issue and pushed updated code, so the script should work if you re-download BHR. Thanks for pointing out this issue.
  2. The high lambda GC, and the fact that 9538/14002 genes are marked as significant, makes me think that the betas are not in the format that BHR is expecting. How are you computing the betas? For an example of how to calculate betas in the correct format (per-allele) from case and control AC, see the example in the wiki, relevant section reproduced below. There should be no need to remove singletons--in fact, singletons are where the majority of the signal is for BHR.
#Compute sample prevalence, will be used to compute per-sd beta
  prevalence = n_cases/(n_cases+n_controls) 

  #Compute variant MAF in cases, will be used to compute per-sd beta
  table$AF_case = table$ac_case/(2*n_cases) 

  #Compute variant MAFoverall, will be used to compute per-sd beta
  table$AF = (table$ac_case + table$ac_ctrl)/(2*(n_cases + n_controls)) 

  #calculate per-sd betas
  table$beta = (2 * (table$AF_case - table$AF) * prevalence)/sqrt(2 * table$AF * (1 - table$AF) * prevalence * (1 - prevalence))  

  #calculate variant variances
  table$twopq = 2*table$AF * (1 - table$AF) 

  #convert betas from per-sd (i.e. sqrt(variance explained)) to per-allele (i.e. in units of phenotype) betas.
  #per-allele are the usual betas reported by an exome wide association study.
  table$beta_perallele = table$beta/sqrt(table$twopq)

"beta_perallele" is what should be in the beta column for BHR input.

Let me know if you have any additional issues, and thanks for bearing with us on these issues, as we just made major updates to the code based on manuscript revisions.

bibb commented 1 year ago

Thank you again. I downloaded the package again and now it's running without errors so far. Regarding the bera-perallele, I did the calculations as you put in the example there. Here there are some lines of the "maste table" that I made before formatting things for BHR:

gene    chr     varID   ac_cases        ac_controls     N       AF      AF_cases        Prevalence      beta_per-sd     twopq   beta_per-allele
ENSG00000000419 20      20_49551767_T_C 1       1       24854   4.0235e-05      0.000706215     0.0284864       0.025426        8.04668e-05     2.83446
ENSG00000000419 20      20_49558650_C_T 1       0       24854   2.01175e-05     0.000706215     0.0284864       0.0370435       4.02342e-05     5.84002
ENSG00000000419 20      20_49575021_G_A 8       103     24854   0.00223304      0.00564972      0.0284864       0.0175287       0.00445611      0.262586
ENSG00000000457 1       1_169823529_T_G 1       0       24854   2.01175e-05     0.000706215     0.0284864       0.0370435       4.02342e-05     5.84002
ENSG00000000457 1       1_169831801_C_T 1       0       24854   2.01175e-05     0.000706215     0.0284864       0.0370435       4.02342e-05     5.84002
ENSG00000000457 1       1_169831884_C_T 1       122     24854   0.00247445      0.000706215     0.0284864       -0.00861882     0.00493665      -0.122668
ENSG00000000457 1       1_169845195_A_G 2       110     24854   0.00225316      0.00141243      0.0284864       -0.00429398     0.00449617      -0.0640381
ENSG00000000460 1       1_169677786_AAGGC_A     1       0       24854   2.01175e-05     0.000706215     0.0284864       0.0370435       4.02342e-05     5.84002

The beta_per-allele are big when I have a single count in cases and none or very few in the controls. I have to mention that the sample size is 708 cases and 24146 controls for this phenotype. Please let me know if there is any mistakes in my calculations. BHR is being running for a while now. I'll let you know once it finishes.

bibb commented 1 year ago

I had a problem in the data and I fixed it. I was filtering out all variants with counts 0 in cases and that's why I was having massive lambda GC values. Now they seem better, but still high. According to the results, I got a genetic correlation ($rg$rg_mixed) between my two traits of 0.9605055 with an SE ($rg$rg_mixed_se) 0.2156036. Now I wanted to convert it to the liability scale using your method, but it seems that it's only applicable for a single trait in the Bpex example, not for the bivariate results.

> output_name <- BHR(mode = "bivariate", 
+     trait1_sumstats = sumstats_1,
+     trait2_sumstats = sumstats_2,
+       annotations = list(baseline))
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2022
Running BHR at 2022-12-26 13:42:14
For trait 1, MAF ranges from 7.2e-05 to 0.00093
...seems reasonable
For trait 1, MAF ranges from 7.2e-05 to 0.00093
...seems reasonable
Lambda GC:  3.20422812894794
Lambda GC seems very high. Please check input columns against documentation
13722 genes in the BHR regression, 17 of which are significant fixed effects
...seems reasonable
Lambda GC:  1.07937058701794
...seems reasonable
13722 genes in the BHR regression, 17 of which are significant fixed effects
...seems reasonable
BHR finished at 2022-12-26 13:42:30

Also I got a list of 21 significant genes from $rg$sig_genes, does this mean that those genes are the ones that contribute significantly to the genetic correlation between the two traits?

Thanks again for your help!

Bernabe

ajaynadig commented 1 year ago

Hi Bernabe, Glad to hear that the function ran with those changes!

Re: liability scale, liability scale genetic correlation is equal to observed scale genetic correlation, so there is no need to do any such conversion. Intuitively, this is because the liability scale factors "cancel out" when computing genetic correlation. A more formal demonstration of this can be found in the supplement of Bulik-Sullivan, Finucane et al, 2015, NG, in section 1.4, "Ascertained Studies of Liability Threshold Traits".

Re: "sig_genes", those 21 genes are the union of the significant genes for trait 1 and the significant genes for trait 2, which are then modelled as fixed effects in computing the genetic correlation. So, these are the "univariate" significant genes for each trait, not genes that contribute significantly to the genetic correlation. So, for example, if trait 1 had 10 significant genes, trait 2 had 0 significant genes, and trait 1 and trait 2 had a genetic correlation of 0, BHR would still model those 10 genes as "significant" fixed effects.

Lastly, that Lambda GC warning probably is a result of the low sample size, and may not be a major cause for alarm, now that the estimate is more realistic than it was previously. We may in the future update that warning to be less strict.

Hope that helps, and let me know if you have any other questions--this is very helpful feedback as we fine-tune the package.

bibb commented 1 year ago

Thanks again for your reply. Now I realized that I was trying to get a genetic correlation on the liability scale which is pointless, because that is just a correlation, it's not a heritability value per se. Now I'm trying to interpret my results, and a rg value of 0.96 seems to be a high correlation for the two traits I tested, but I'm not sure if I can call this significant. For instance, in a Pearson's correlation we get the Pearson's coefficient and a p value. Is it possible to obtain a P value here for the genetic correlation here? Re. significant genes: Got it, so for my traits I see that the significant genes are the same:

$trait1$significant_genes$significant_genes
 [1] "ENSG00000188641" "ENSG00000094963" "ENSG00000180228" "ENSG00000115415"
 [5] "ENSG00000177879" "ENSG00000157764" "ENSG00000253729" "ENSG00000070756"
 [9] "ENSG00000087053" "ENSG00000189182" "ENSG00000092445" "ENSG00000140479"
[13] "ENSG00000167476" "ENSG00000104904" "ENSG00000188305" "ENSG00000090006"
[17] "ENSG00000128408"
$trait2$significant_genes$significant_genes
 [1] "ENSG00000188641" "ENSG00000094963" "ENSG00000180228" "ENSG00000115415"
 [5] "ENSG00000177879" "ENSG00000157764" "ENSG00000253729" "ENSG00000070756"
 [9] "ENSG00000087053" "ENSG00000189182" "ENSG00000092445" "ENSG00000140479"
[13] "ENSG00000167476" "ENSG00000104904" "ENSG00000188305" "ENSG00000090006"
[17] "ENSG00000128408"

But then for the regression the list is bigger:

$rg$sig_genes
 [1] "ENSG00000070756" "ENSG00000087053" "ENSG00000090006" "ENSG00000092445"
 [5] "ENSG00000094963" "ENSG00000104904" "ENSG00000115415" "ENSG00000128408"
 [9] "ENSG00000140479" "ENSG00000146839" "ENSG00000157764" "ENSG00000167476"
[13] "ENSG00000177879" "ENSG00000180228" "ENSG00000188305" "ENSG00000188641"
[17] "ENSG00000189182" "ENSG00000223960" "ENSG00000225243" "ENSG00000253729"
[21] "ENSG00000257700"

Here I was expecting to see the same list of 17 genes, but I have now 21. Also, in the terms $trait1$significant_genes$sig_table and $trait2$significant_genes$sig_table I see that the column number_variants is 0 for all the listed genes. Is this some kind of error? but still I have values for the rest of the columns.

I hope this makes any sense, otherwise please let me know.

ajaynadig commented 1 year ago

Hi Bernabe,

Now I'm trying to interpret my results, and a rg value of 0.96 seems to be a high correlation for the two traits I tested, but I'm not sure if I can call this significant. For instance, in a Pearson's correlation we get the Pearson's coefficient and a p value. Is it possible to obtain a P value here for the genetic correlation here?

Without knowing more about your scientific question, I am not sure whether 0.96 is a plausible number. That being said:

  1. We have observed that burden genetic correlations are often higher than common variant genetic correlations, see Figure 5 of our manuscript.
  2. You previously mentioned that your estimate was 0.96, with a SE of 0.22. This means that the 95% confidence intervals are quite wide, including values as low as 0.53. These wide error bars are likely due to low sample size.
  3. The package does not currently compute p-values; this is something we could explore in the future. However, you can get a basic p-value using the estimate and SE to calculate a z-score:

pnorm(0.96/0.22, lower.tail = FALSE)

Also, in the terms $trait1$significant_genes$sig_table and $trait2$significant_genes$sig_table I see that the column number_variants is 0 for all the listed genes. Is this some kind of error? but still I have values for the rest of the columns.

This is a new feature, and the 0s you were observing were due to a bug. I have now corrected this, you should be able to see the number of variants if you re-download the package. Thanks for pointing this out!

Here I was expecting to see the same list of 17 genes, but I have now 21. 

I am less sure what is happening here, as the script uses the same list of significant genes for individual trait heritability and trait pair rg. This type of thing may happen when the sample size of one dataset is much larger than the other, so that there are "significant" genes for one trait that have no variants in the other trait dataset. I have made a small change that may fix this issue--could you re-download the package and give it a try?

If that doesn't work, for the four genes that are in the rg significant genes list but not the individual trait lists, can you please check the counts of variants in those genes for the individual traits?

Thanks!

bibb commented 1 year ago

Hi, thank you for last modifications. Now I'm able to see the number of variants per gene in the list of significant genes. Thank you also for the comments about results interpretation. I think I'll move cautiously with my current results and I'll see if I'm able to replicate the regression using independent datasets.

Bernabe

ajaynadig commented 1 year ago

Great! Thank you for the help in identifying bugs in the package.

bibb commented 1 year ago

Hi,

I have noticed something related to the exome-wide significant associations found by BHR. I have a dataset where after running BHR it finds ~2500 significant genes and it calculates a very high lambda (lambda GC=18) . I noticed that those genes actually have an unadjusted burden p < 0.05 when I run a Fisher's exact test on the collapsed allele counts. So I wonder if the software is really considering exome-wide significance, which for my understanding is p ~ 2.1x10-6 after doing a Bonferroni correction for 23,000 human genes or if the software is considering exome-wide significant just nominal p < 0.05 without any corrections. Thanks.

ajaynadig commented 1 year ago

Hi Bernabe, A lambda GC of 18 strongly suggests that the summary statistics are misspecified--is this dataset processed similarly to the previous datasets you analyzed, which had much smaller lambda GC? I recommend double checking that the betas are defined correctly against the documentation, and that there is no gross confounding (e.g. due to ancestry) in your dataset. You may consider using a package like SAIGE to calculate variant-wise effect sizes, which should address both of these issues.

The significance test implemented in BHR is a chi-squared test with a Bonferroni threshold (e.g. p < 0.05/num_genes). I will note that we do not claim that this is the optimal approach for identifying significant genes; it is just the simplest approach. If you have a better method for identifying significant genes, you can pre-specify a list of significant genes using the fixed_genes flag; see Documentation.

Hope that helps!

ajaynadig commented 1 year ago

Hi Bernabe, Since its been a while since I last wrote, I'm going to close this issue for now. Please feel free to comment again if you have any other Qs.