Unreasonable output for genetic correlation and warnings.

Asukayj commented 1 year ago

Dear BHR team,

I am now trying to estimate the genetic correlations across traits and encounter some problems regarding the output. I want to estimate the genetic correlations between 100 phenotype pairs for pLoFs and here is the code.

d_sum_stat<-fread('d_step2.regenie',header=TRUE)
for (i in 1:100){
  v_sum_stat<-fread(file=sprintf('v_step2_p%i.regenie',i),header=TRUE)
  sumstats_1<-make_sum_stat_input(v_sum_stat,gene_plof,'v',set_lof_del_missense)
  sumstats_1<-sumstats_1[which(sumstats_1$AF<1e-04),]
  sumstats_2<-make_sum_stat_input(d_sum_stat,gene_plof,'d',set_lof_del_missense)
  sumstats_2<-sumstats_2[which(sumstats_2$AF<1e-04),]

  vd_cor<-BHR(mode = "bivariate",
                      trait1_sumstats = sumstats_1,
                      trait2_sumstats = sumstats_2,
                      annotations = list(baseline))
  cor[i]<-vd_cor$rg$rg_mixed
  se[i]<-vd_cor$rg$rg_mixed_se
}
warnings()
result_full<-cbind(cor,se)

Below is the result (first 30 pairs for example).

cor se
0.14163704064036142 1.0181431429891998
0.5938065480805605 1.3387596506891408
-0.823315573993914 1.849941797056604
NA NA
1.5804725225475114 3.215072771329553
-2.4679767145625466 4.790055611280559
0.7945857014366327 1.4334269509774815
-1.2955636066645622 2.917031758391116
NA NA
0.35441703483640163 0.7421794593638683
-0.15582348965590792 0.7253752959836084
0.596469560197222 1.5777629086592375
0.3580207754150855 0.9875436325078346
-1.0219255708606194 2.046317020740693
0.5405050865951606 1.7802310981855307
NA NA
-0.9178890504381046 1.3054846807609333
-0.6796574626872679 1.037019672371673
0.46399758935919044 1.0507475822063104
NA NA
0.4215996368948839 NA
0.3989282484392754 1.4819945351492592
-0.8974638655095383 1.4983005159584262
-2.3295040862418985 NA
-0.6193311961987855 1.2184403787586593
0.5420860464082468 1.3761187818170735
NA NA
4.278315151287011 NA
1.2776415259120846 2.3294776842734968

There seems to be so many large 'SE's and NAs. All of the traits are quantitative and the sample size for my variant-level summary statistics is about 25k. I have included all the singleton sites for estimating.

Below is the warning information

... ...
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2023
Running BHR at 2023-03-12 05:22:51
For trait 1, MAF ranges from 1.5e-05 to 9.9e-05
...seems reasonable
For trait 2, MAF ranges from 1.6e-05 to 9.9e-05
...seems reasonable
Lambda GC:  0.3346
Lambda GC seems low. Please check input columns against documentation
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
Lambda GC:  0.7899
...seems reasonable
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
BHR finished at 2023-03-12 05:23:11
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2023
Running BHR at 2023-03-12 05:23:19
For trait 1, MAF ranges from 1.5e-05 to 1e-04
...seems reasonable
For trait 2, MAF ranges from 1.6e-05 to 9.9e-05
...seems reasonable
Lambda GC:  0.9617
...seems reasonable
15523 genes in the BHR regression, 1 of which are significant fixed effects
...seems reasonable
Lambda GC:  0.7899
...seems reasonable
15523 genes in the BHR regression, 1 of which are significant fixed effects
...seems reasonable
BHR finished at 2023-03-12 05:23:38
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2023
Running BHR at 2023-03-12 05:23:46
For trait 1, MAF ranges from 1.5e-05 to 9.9e-05
...seems reasonable
For trait 2, MAF ranges from 1.6e-05 to 9.9e-05
...seems reasonable
Lambda GC:  0.1107
Lambda GC seems low. Please check input columns against documentation
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
Lambda GC:  0.7899
...seems reasonable
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
BHR finished at 2023-03-12 05:24:06
Burden Heritability Regression
Daniel Weiner, Ajay Nadig, and Luke O'Connor, 2023
Running BHR at 2023-03-12 05:24:12
For trait 1, MAF ranges from 1.5e-05 to 9.9e-05
...seems reasonable
For trait 2, MAF ranges from 1.6e-05 to 9.9e-05
...seems reasonable
Lambda GC:  0.1142
Lambda GC seems low. Please check input columns against documentation
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
Lambda GC:  0.7899
...seems reasonable
15534 genes in the BHR regression, 0 of which are significant fixed effects
...seems reasonable
BHR finished at 2023-03-12 05:24:32

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
2: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
3: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
4: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
5: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
6: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
7: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
8: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
9: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
10: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
11: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
12: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
13: In sqrt(trait1_h2_mixed * trait2_h2_mixed) : NaNs produced
14: In sqrt((t(u) %*% u) + h11) : NaNs produced
15: In sqrt((t(u) %*% u) + h11) : NaNs produced
16: In sqrt((t(u) %*% u) + h11) : NaNs produced
17: In sqrt((t(u) %*% u) + h11) : NaNs produced
18: In sqrt((t(u) %*% u) + h11) : NaNs produced
19: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
20: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
21: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
22: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
23: In sqrt(trait1_h2_mixed * trait2_h2_mixed) : NaNs produced
24: In sqrt((t(u) %*% u) + h11) : NaNs produced
25: In sqrt((t(u) %*% u) + h11) : NaNs produced
26: In sqrt((t(u) %*% u) + h11) : NaNs produced
27: In sqrt((t(u) %*% u) + h11) : NaNs produced
28: In sqrt((t(u) %*% u) + h11) : NaNs produced
29: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
30: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
31: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
32: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
33: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
34: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
35: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
36: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
37: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
38: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
39: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
40: In sqrt(trait1_h2_mixed * trait2_h2_mixed) : NaNs produced
41: In sqrt((t(u) %*% u) + h11) : NaNs produced
42: In sqrt((t(u) %*% u) + h11) : NaNs produced
43: In sqrt((t(u) %*% u) + h11) : NaNs produced
44: In sqrt((t(u) %*% u) + h11) : NaNs produced
45: In sqrt((t(u) %*% u) + h11) : NaNs produced
46: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
47: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
48: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced
49: In sqrt(h2_trait1_jackknife * h2_trait2_jackknife) : NaNs produced
50: In sqrt(h2_trait1_random * h2_trait2_random) : NaNs produced

I have checked my input files and not found the solutions yet. Let me know if there may be some issues I have not noticed. Thank you in advance!

Best, Yijun

ajaynadig commented 1 year ago

Hi Yijun, This issue is a little hard to troubleshoot just based on the output--would you be able to share two of the summary statistic files so that I can reproduce and diagnose the issue? My email is anadig [at] broadinstitute [dot] org.

Best, Ajay

Asukayj commented 1 year ago

Thanks so much and I have just sent the files to this address!

CharleyXia commented 1 year ago

Any update on this closed thread? I found the same problem.

ajaynadig commented 1 year ago

Hi Charley--I replied to @Asukayj yesterday, see below for my message:

I think I've diagnosed the issue. Bear with me, it is a little technical:

In a regression-based estimator like BHR, colinearity can cause signal to "leak" from one term into another. In the case of BHR, we noticed that the intercepts from the pLoF models were often greater than the synonymous models (see Supp Fig 8, first set of bar plots from left), suggesting that this was happening. To address this, we added "null burden statistics" to the regression; this is described in the methods paragraph that begins "Second, we incorporate null burden statistics that effectively fix the BHR intercept and ameliorate bias in its slope...", and in Supplementary Figure 8 from our published manuscript, which shows the synonymous and pLoF intercepts getting closer as we include more and more sets of null burden statistics in the regression.

It was less obvious how to implement this for genetic correlation, so the current rg method does not use null burden statistics for heritability or covariance estimation. At the time, it was unclear how much this would matter in practice, and it didn't seem to affect our results very much. However, this appears to make a big difference for the traits you are analyzing.

The non-significant heritability from the rg function is likely the cause of the odd rg estimates; in the setting of small/nonsignificant h2, rg estimates are generally very unstable.

I will discuss this issue with my collaborators, but I believe that a quick fix is to simply use null burden statistics for heritability but not covariance estimation within the rg function. To that end, I have changed the code on github to include an addition flag "use_null_conditions_rg". You can use this flag as follows:

bhr_cor<-BHR(mode = "bivariate", trait1_sumstats = d_sumstats, trait2_sumstats = v_sumstats, annotations = list(baseline_ori), num_null_conditions = 5, use_null_conditions_rg = TRUE)

At a quick glance, it looks like the BHR rg estimates are now more reasonable for the sumstats you provided (although unfortunately still a bit underpowered, which is in line with our analyses in this N range; see bipolar + schizophrenia analyses from our paper). Can you download the updated BHR functions and let me know if this works for you? Thanks!

CharleyXia commented 1 year ago

Thanks, ajaynadig! It makes prefect sense! This happens when heritability is samll and non significiant. I will try the new codes. Thanks again for the help.

ajaynadig commented 1 year ago

Closing this issue for now, feel free to reopen if other issues arise.

ajaynadig / bhr

Unreasonable output for genetic correlation and warnings. #6