How to best save per gene data

gaow commented 2 years ago

We need to save per gene data for workflows with SuSiE, mvSuSiE, MR-MASH and FUSION.

Currently we have two ways to save it

in RDS format for R based methods -- very flexible, can save genotype X, multi-condition phenotype Y, residual phenotype Y-res, summary stats b and s, and LD matrix R.
genotype in PLINK format for FUSION; phenotype still in bed format

I propose to merge them and adjust SuSiE etc pipeline accordingly

Per-gene data storage

Genotype X in PLINK format
multi-condition phenotype Y as a matrix with rows being the samples in X. There will be missing data in this Y matrix per condition, in gz format
compute and save residual Y-res the same was as 2 above. We may already have some functions in susieR package to do that; if not I'll create one based on our current residual_Y module. Save it in gz format
we don't do anything to summary stats and LD matrix at this point.

FUSION analysis

Take 1 and 3 from above. We need to modify the FAM file on the fly to put in the Y-res from 3.

SuSiE/mvSuSiE

read_plink in R to get genotype X
I will create (copy from existing) a utility function to susieR to fill missing data in X by mean imputation, to center and standardize it
I will create (copy from existing) a utility function to susieR to take covariates and regress covariates out of X's
load Y-res for analysis

Summary statistics and LD

For integrative analysis we should not attempt to deal with merging summary statistics (sumstats) and LD in any individual module. They are non-trivial. I proposed it #153. We should not save LD per analysis unit, if there is a good way to generate it on the fly.

For sumstats itself,

[ ] We save conventional association tests sumstats (tensorQTL, APEX) to VCF files, for multiple studies
- Alternatively, we have to "merge" them first using #153 when they come from different sources; then convert to VCF
- We perform METAL and/or MTAG here as necessary for various subsets of studies, then convert these meta-analysis results to VCF
[ ] For MASH and mixture prior we need to modify the "fastqtl to mash pipeline" to get MASH data from the VCF file. This is perhaps best done with gwas_vcf R package?
[ ] We convert results from fine-mapping to VCF files
[ ] We convert results from MR-MASH to VCF files

gaow commented 2 years ago

RE RSS version of workflows: before we have some progress on #153, perhaps the best way to run SuSiE RSS is to do it through polyfun::finemapper.py that wraps susie_rss after summary stats merger with reference genotype (in PLINK format) used for LD.

mvsusie_rss is trickier. I dont think we need to run it anyways at this point, for it's still unpublished method. We'll run mvsusie though using the input format I posted earlier above. Once we make progress on #153, we can get per gene summary stats easily with VCF related tools, and compute LD on the fly given genomic region; then feed that to mvsusie_rss

hsun3163 commented 2 years ago

For per gene phenotype, I think it is best to save the per gene res_Y for each study and then merged them as need arise. I think this is more flexible as to what studies got analysis together down the road.

cumc / xqtl-protocol