Project aims and action plans

This ticket outlines what I have in mind for now to get us started. I might continue to edit this post as I can think of more to add or reorganize. So please check back on this from time to time.

@hsun3163 for starters, please click on the watch button at the top right of the repo to receive notifications for new tickets opened.

In between the lines below, I'll add some TODO boxes for you to complete and check-off.

TWAS background & warming up

A starting point is the FUSION package. In addition to being a software meta package it is also a nice resource to find TWAS reference papers, and to download small test data-set to learn about the analysis.

[x] Download all papers listed on FUSION website to Mendeley reference manager in a folder called TWAS, and share that folder with @gaow (my Mendeley email is wang.gao@columbia.edu)
[x] Additionally, please also collect and read these papers:
- Ndungu et al (2020) AJHG
- Gusev et al (2018) Nat Genet (ignore it if FUSION website has it already)
[x] Find some recent and / or high profile application papers using these methods, as a practical guideline to your TWAS analysis. You can use web of science (free access through Columbia library -- need your CU VPN on), or Google scholar, to find what papers cite them. I find Web of Science pretty neat.
[x] Download and analyze example data-set provided by TWAS FUSION package: the key is to try write it as an SoS workflow. I'll share with you some example workflows we have been developing. But you should already seen some toys in orientation material. It is extremely important that all codes developed for this project are easily reproducible and can be used for other similar projects!.

Methods-wise, my two high-level suggestions are 1) view them as variable selection (VS) tasks in regression and try to view differences between methods from a quantitative genetic point of view when you think if an assumption made by a VS makes sense in the context of genetics. 2) Understand how it works with the so-called "GWAS summary statistics"

Please slack me if you have any questions about details in those papers. You dont have to understand all papers at once. You can get a rough idea for now and re-read them as you work on the project.

Our project

At Columbia Neurology we have multi-omics data from brains for thousands of individuals. This is terrific resource because as you'll learn from those papers in Background section, the multi-omics molecular phenotypes can be tissue / cell type specific. Since diseases pathology are also likely tissue specific --- eg Alzheimer's disease (AD) and brain tissues --- it would make the most sense to train a prediction model on our brain multi-omics data, and use that to map neurological disease associations.

Get some TWAS done

Here is a rough analysis outline:

[ ] Preprocess multi-omics data the molecular phenotypes of interest. Let's call it Dataset M. Let's start with gene expression, alternative splicing, chromosomal accessibility and DNA methylation marks.
[x] For each molecular phenotype, train a TWAS prediction model and get weights for variants.
[ ] The weights learned from Dataset M can be used to infer molecular phenotypes for any other input individuals. Say we have another dataset Dataset G with just genotypes but not any molecular phenotypes data, then we can use these estimated effect sizes from M and apply those on G to predict molecular phenotypes on G.
[ ] For each sample in G we also have its phenotype eg AD status. So with the predicted molecular phenotypes we can test for association between molecular phenotype levels such as gene expressions, and the AD status. We can then find the genes that differ in molecular phenotypes between AD cases and controls.
[ ] Additionally for publicly available GWAS summary statistics, we can also use the weights trained in our data to test for associations in publicly available neuropsychartric traits. Of course it would be noisier than working with our own data. So we can perhaps table that effort.

All analysis have to be made into SoS pipelines nicely documented. There are some codes I have as jump-start branch in this repository, as your starting point. We can talk about that branch in our meeting.

Above analysis are just conventional TWAS. But technically can be challenging to work with different molecular phenotypes. To name a few challenges i can think of now:

Understanding what they are and their file format
What range of cis regulating genotypes to consider
What adjustment should we make to the model to account for confounders etc

When you read the reference papers particularly practical papers, you should keep these questions in mind and find answers to them

Do something novel

We can talk about these more noval analysis after above are done

Some molecular phenotypes might be related, or share some regulator variants. Instead of estimating the weights one phenotype at a time, how about try estimating them jointly?
What about we use both genotypes around an analysis unit (eg a gene) and the predicted molecular phenotypes to test for disease associations?
If we have enough data, how about we try achieve 1 using approaches such as deep learning? (need literature research and prototyping). I guess one obvious draw back is that it is not clear how to use summary statistics in this context.

gaow / neuro-twas