bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
627 stars 339 forks source link

Questions Regarding S-LDSC Analysis with --ref-ld-chr-cts #447

Open YingkaiSun opened 1 month ago

YingkaiSun commented 1 month ago

Hi! I am currently using the --ref-ld-chr-cts option to perform S-LDSC analysis, and I have a few questions:

  1. The output of my analysis only includes the columns Name, Coefficient, Coefficient_std_error, and Coefficient_P_value. Could you please advise on how I can obtain the prop.h2 and enrichment results from this analysis?
  2. Additionally, I am curious about the role of the control annotations in cell-type-specific analyses. Are these control annotations used to select SNPs for inclusion in the regression, or are they included alongside the target annotations as covariates in a joint analysis? Alternatively, could the control annotations be used as effect modifiers to compare the slope between SNPs where control=1 versus control=0?

Thank you very much for your time and assistance. I appreciate the incredible tool you’ve developed, and any guidance on these questions would be greatly appreciated.

Best regards, Sun Yingkai

aksarkar commented 1 month ago
  1. You would need to modify the source code to do this.
  2. They are included alongside the target annotation as covariates in a joint analysis. The meaning of the regression coefficient is exactly how the chi^2 statistic changes when the annotation changes from 0 to 1, holding all other values fixed.
YingkaiSun commented 1 month ago

Thank you for your response! I would like to seek further clarification on the interpretation of regression coefficients when control annotations are included in the S-LDSC model.

Here is a snippet from the .ldcts file in Multi_tissue_gene_expr.ldcts: V1 V2
Adipose_Subcutaneous Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.1.,Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.
AdiposeVisceral(Omentum) Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.2.,Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.
Adrenal_Gland Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.3.,Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.
Artery_Aorta Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.4.,Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.
Artery_Coronary Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.5.,Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.
... ...

I’m curious about how to interpret the regression coefficients for both the target and control annotations when the control annotation is used as a covariate in the regression model, especially when this control annotation is shared across multiple tissues.

For instance, in the Multi_tissue_gene_expr.ldcts file, multiple tissue annotations share the same control: Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control. Does this mean that the regression coefficient for the target annotation (where annotation = 1 vs. 0) is independent of whether the control annotation is equal to 1 or 0? If so, what exactly does the control annotation represent in this context, and how should we interpret its coefficient when included in the model?

Thank you very much for your time and assistance.

Best regards, Sun Yingkai

aksarkar commented 1 month ago

Recall that ldsc fits a multiple linear regression of chi^2 statistics onto the LD scores partitioned by each annotation.

This means that the regression coefficient for the target annotation is not independent of the control annotations.

In the context of looking for cell-type specific annotations, the control annotations (baseline model) are meant to represent non-specific sources of heritability enrichment.

The reason to include them is to provide evidence that an enrichment for a cell-type specific annotation is indeed driven by something cell-type specific, and not something non-specific (for example, promoter-associated histone modifications).

This does seem to require using the "standard" baseline model referred to in Finucane et al. 2015, etc. in addition to any additional controls.

YingkaiSun commented 1 month ago

Thank you so much for your detailed response to my previous inquiry! but I realize that my previous question might not have been as clear as it could have been, here is a detailed description.

In the context of cell type specific analyses, as demonstrated in the wiki’s demo code, the --ref-ld-chr-cts flag is used to specify a .ldcts file that includes both target and control annotations for each cell or tissue type. Below is an example of the demo code provided:

ldsc.py \
    --h2-cts UKBB_BMI.sumstats.gz \
    --ref-ld-chr 1000G_EUR_Phase3_baseline/baseline. \
    --out BMI_${cts_name} \
    --ref-ld-chr-cts $cts_name.ldcts \
    --w-ld-chr weights_hm3_no_hla/weights.

In this code, the --ref-ld-chr flag specifies the use of the baseline model (1000G_EUR_Phase3_baseline/baseline.), which, as you mentioned, captures broad non-specific sources of heritability enrichment. However, the --ref-ld-chr-cts flag simultaneously specifies a .ldcts file, which includes both target and control annotations. Here is an example of what such a .ldcts file might look like:

V1 V2
Adipose_Subcutaneous GTEx.1.,GTEx.control.
AdiposeVisceral(Omentum) GTEx.2.,GTEx.control.
Adrenal_Gland GTEx.3.,GTEx.control.
Artery_Aorta GTEx.4.,GTEx.control.
Artery_Coronary GTEx.5.,GTEx.control.
... ...

When I read the corresponding files in R. Here is an example of what I found: For the target annotation:

> fread('/syk12961/reference/ldsc/LDSCORE-SEG/Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.1.1.annot.gz')
   ANNOT
   <int>
1:     0
2:     0
3:     0
4:     0
5:     0
...
779350:     0
779351:     0
779352:     0
779353:     0
779354:     0

   ANNOT    n
   <int>  <int>
1:     0 604825
2:     1 174529

For the control annotation:

> fread('/syk12961/reference/ldsc/LDSCORE-SEG/Multi_tissue_gene_expr_1000Gv3_ldscores/GTEx.control.1.annot.gz')
   All_Genes
      <int>
1:       1
2:       1
3:       1
4:       1
5:       1
...
779350:       1
779351:       1
779352:       1
779353:       1
779354:       1

   All_Genes    n
   <int>    <int>
1:     0    82864
2:     1   696490

Given this setup, my question is:

  1. Are these control annotations being used as covariates to adjust for non-specific effects in a similar manner to the baseline model, or do they serve a different role within the S-LDSC framework?
  2. How should we interpret the regression coefficients for these control annotations if they are included alongside the target annotations in the model?

I hope this explanation clarifies my questions. I would be very grateful for any further insights you could provide. Thank you once again for your time and response!

aksarkar commented 3 weeks ago

As stated in the wiki

Each line has two sets of LD scores to include: one is the set of LD scores corresponding to the specifically expressed genes in the cell type, while the second one is a "control" gene set of all genes. The result that will be reported will be the regression coefficient for the first set of LD scores in the list.

So, the answers to the questions are still as I gave above.

The interpretation of the coefficient for the control annotation is the same as for any other annotation.

YingkaiSun commented 2 weeks ago

Thanks! After I review the wiki carefully, I guess I might figure it out. The control annotate all genes-related SNPs, which could be considered as an extra adjustment on the basis of baseline model for the comparability between different cells or tissues. It functions in the same way as baseline model. Is that right?