cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
41 stars 43 forks source link

How to best save per gene data #154

Open gaow opened 2 years ago

gaow commented 2 years ago

We need to save per gene data for workflows with SuSiE, mvSuSiE, MR-MASH and FUSION.

Currently we have two ways to save it

  1. in RDS format for R based methods -- very flexible, can save genotype X, multi-condition phenotype Y, residual phenotype Y-res, summary stats b and s, and LD matrix R.
  2. genotype in PLINK format for FUSION; phenotype still in bed format

I propose to merge them and adjust SuSiE etc pipeline accordingly

Per-gene data storage

  1. Genotype X in PLINK format
  2. multi-condition phenotype Y as a matrix with rows being the samples in X. There will be missing data in this Y matrix per condition, in gz format
  3. compute and save residual Y-res the same was as 2 above. We may already have some functions in susieR package to do that; if not I'll create one based on our current residual_Y module. Save it in gz format
  4. we don't do anything to summary stats and LD matrix at this point.

FUSION analysis

Take 1 and 3 from above. We need to modify the FAM file on the fly to put in the Y-res from 3.

SuSiE/mvSuSiE

  1. read_plink in R to get genotype X
  2. I will create (copy from existing) a utility function to susieR to fill missing data in X by mean imputation, to center and standardize it
  3. I will create (copy from existing) a utility function to susieR to take covariates and regress covariates out of X's
  4. load Y-res for analysis

Summary statistics and LD

For integrative analysis we should not attempt to deal with merging summary statistics (sumstats) and LD in any individual module. They are non-trivial. I proposed it #153. We should not save LD per analysis unit, if there is a good way to generate it on the fly.

For sumstats itself,

gaow commented 2 years ago

RE RSS version of workflows: before we have some progress on #153, perhaps the best way to run SuSiE RSS is to do it through polyfun::finemapper.py that wraps susie_rss after summary stats merger with reference genotype (in PLINK format) used for LD.

mvsusie_rss is trickier. I dont think we need to run it anyways at this point, for it's still unpublished method. We'll run mvsusie though using the input format I posted earlier above. Once we make progress on #153, we can get per gene summary stats easily with VCF related tools, and compute LD on the fly given genomic region; then feed that to mvsusie_rss

hsun3163 commented 2 years ago

For per gene phenotype, I think it is best to save the per gene res_Y for each study and then merged them as need arise. I think this is more flexible as to what studies got analysis together down the road.