ajaynadig / bhr

Suite of heritability and genetic correlation estimation tools for exome-sequencing data
MIT License
31 stars 6 forks source link

The role of 'Baseline-BHR' input/gene set annotation #5

Closed Asukayj closed 1 year ago

Asukayj commented 1 year ago

Hi,

I am making a quick run of BHR on my analysis to evaluate certain types of burden heritability and get some intuition. However, I am a little confused about the role that the required 'baseline file' (in the gene annotation part) plays. Specifically, from your Wiki page, I assume it is a gene annotation file that controls for frequency-dependent architecture and eliminates bias in the estimation, but I am not clear how 'quintiles of the observed/expected loss-of-function distribution' (the case in your example) can achieve such result. In other word, why cannot I just input the interested genes in a set instead of the gene annotation process as your baseline file. Let me know if I should refer to some part of your paper.

Practically, could you give me some hints on how to prepare for my baseline file in the analysis if ① I want to estimate the burden heritability or genetic correlation on my own data
② I want to do 'aggregate BHR analysis' (should I change the baseline file or just use the previous file? If so, how should I prepare for that?)

I appreciate your work and help!

danjweiner commented 1 year ago

Hi there,

Thanks for your interest in BHR and your questions!

A plausible concern is that genes under selective constraint 1) have fewer alleles and thus a smaller burden score, and 2) larger mean effect sizes. This scenario would reduce the slope of a BHR regression, thus creating downward bias in a heritability estimate. To address this concern, we add a set of covariate annotations to the BHR regression called the baseline model. The baseline model is an annotation set for gene constraint. There are many ways to estimate the selective constraint of a gene; in BHR, we use the observed/expected pLoF ratio from gnomAD. We found that dividing genes into 5 groups ("observed/expected quintiles") based on constraint was sufficient to control for this bias. For more details, see "Independence assumption and selection-related bias" in the methods section of the manuscript.

If I understand your question, you also want to analyze a gene set. Please refer to this section of the Wiki for details and let us know if you have additional questions.

Finally, you also asked about preparing your own baseline file for running BHR. You are certainly welcome to create your own, but we suspect that most users won't need to. We've provided one for your use here. This same baseline file is appropriate for all standard BHR analyses you mention (univariate, genetic correlation, aggregation).

I hope that has answered your questions, but please let us know if you have additional ones.

Best, Team BHR

Asukayj commented 1 year ago

Thanks so much for your detailed explanation! I reviewed the "Independence assumption and selection-related bias" part of the paper again and your clear clarification helped me greatly understand your method and idea.

Thank you again and I will continue to play with the BHR!

Cheers, Yijun

ajaynadig commented 1 year ago

Great, feel free to reopen if additional issues arise.

hoangthienan95 commented 12 months ago

Thanks for the great explanation on the the role of the baseline model. I have some questions on the topic:

1) Have you compared running BHR with baseline as quintiles of other selection constraint metric like pLI?

2) Also, if I'm only analyzing missense or synonymous variants, should I still use the pLoF o/e quintiles as baseline or use missense/synonymous o/e accordingly. I guess a related question would be is it common to have traits that have different selective constraint between pLoF/missense/synonymous at the quintile-switching magnitude?

danjweiner commented 12 months ago

Hi -- thanks for your follow-up questions!

1) The baseline model is a regression covariate that attempts to control for the effect of selection: genes under selective constraint have smaller burden score and larger mean effect sizes, creating downward bias in a heritability estimate. Accordingly, the baseline model should be derived from the most direct estimate of selection. We found that LoF o/e provided this. As you suggested, we evaluated other constraint estimators like LOEUF and found that heritability estimates were attenuated in those models, suggesting that LOEUF did not capture constraint as well as LoF o/e.

2) The LoF baseline model should be used for all BHR runs, as it provides the best estimate of gene-level selection. In contrast, assuming there is no selection against synonymous various, a baseline model derived from synonymous o/e would not be informative about gene-level selection.

Hope this helps! Dan

hoangthienan95 commented 12 months ago

Thanks @danjweiner ! I'll use the LoF baseline even when analyzing only missense variants from now