Open gaow opened 4 years ago
@hsun3163 thanks I saw your TWAS collection. Sorry I didn't realize it was not obvious from that page. There is this line on the FUSION website:
Expression weights were typically computed from BLUP, BSLMM, LASSO, Elastic Net ...
This is actually the methods I was referring to that you should read about. Those other papers are nice but they might have slightly different focuses (colocalization, fine-mapping, mediation analysis) which we will also learn about, but down the road. Those methods papers above can help you understand the basic model and math.
A note on Elastic Net: this is actually the predixcan paper (Haky Im 2016?? Nature Genetics) so please use that paper as methods reference for Elastic Net. They also have a version S-Predixcan for using summary statistics please include that paper too.
This ticket outlines what I have in mind for now to get us started. I might continue to edit this post as I can think of more to add or reorganize. So please check back on this from time to time.
@hsun3163 for starters, please click on the
watch
button at the top right of the repo to receive notifications for new tickets opened.In between the lines below, I'll add some TODO boxes for you to complete and check-off.
TWAS background & warming up
A starting point is the FUSION package. In addition to being a software meta package it is also a nice resource to find TWAS reference papers, and to download small test data-set to learn about the analysis.
Methods-wise, my two high-level suggestions are 1) view them as variable selection (VS) tasks in regression and try to view differences between methods from a quantitative genetic point of view when you think if an assumption made by a VS makes sense in the context of genetics. 2) Understand how it works with the so-called "GWAS summary statistics"
Please slack me if you have any questions about details in those papers. You dont have to understand all papers at once. You can get a rough idea for now and re-read them as you work on the project.
Our project
At Columbia Neurology we have multi-omics data from brains for thousands of individuals. This is terrific resource because as you'll learn from those papers in Background section, the multi-omics molecular phenotypes can be tissue / cell type specific. Since diseases pathology are also likely tissue specific --- eg Alzheimer's disease (AD) and brain tissues --- it would make the most sense to train a prediction model on our brain multi-omics data, and use that to map neurological disease associations.
Get some TWAS done
Here is a rough analysis outline:
Dataset M
. Let's start with gene expression, alternative splicing, chromosomal accessibility and DNA methylation marks.Dataset M
can be used to infer molecular phenotypes for any other input individuals. Say we have another datasetDataset G
with just genotypes but not any molecular phenotypes data, then we can use these estimated effect sizes fromM
and apply those onG
to predict molecular phenotypes onG
.G
we also have its phenotype eg AD status. So with the predicted molecular phenotypes we can test for association between molecular phenotype levels such as gene expressions, and the AD status. We can then find the genes that differ in molecular phenotypes between AD cases and controls.All analysis have to be made into SoS pipelines nicely documented. There are some codes I have as
jump-start
branch in this repository, as your starting point. We can talk about that branch in our meeting.Above analysis are just conventional TWAS. But technically can be challenging to work with different molecular phenotypes. To name a few challenges i can think of now:
When you read the reference papers particularly practical papers, you should keep these questions in mind and find answers to them
Do something novel
We can talk about these more noval analysis after above are done