DLeirer / PHD_ML_GAP_2017

2 stars 0 forks source link

title: "README_diff_expression_Project" author: "DJL" date: "15/09/2016" output: word_document: default pdf_document: toc: yes html_document: toc: yes toc_float: yes

Project Overview

This Project is part of Daniel Leirer's PhD. I work primarily with the Genes and Psychosis data from the IoPPN. The aim is to create a classifer for Psychosis.

Folder Structure

proj/
├── P0_Characterise/
│├── figs/
│└── output/
├┬─ P1_Hypothesis_Free/
│├── figs/
│└── output/
├┬─ P2_Hypothesis_Driven/
│├── figs/
│└── output/
├┬─ P3_Enviromental/
│├── figs/
│└── output/
├┬─ P4_Hybrid/
│├── figs/
│└── output/
├┬─ P11_glmnet_bootstrap/
│├── figs/
│├── Enrichment/
│└── output/
├┬─ P51_GAP_2way/
│├── figs/
│└── output/
├── R/
├── data/ ├── enviroment_backup/
└── doc/


┼┘┐┌└├┤┴┬│─

Directory Function:
root directory = contains README and Project files for version control.
P0-PXX = subproject directories
figs = contains figures for subproject
output = contains output files from subproject.
R = Contains Scripts
data = contains all data (with exception of large data frames. Uploaded on Google Drive.)
enviroment_backup = folder for backing up enviroment to avoid dataloss. doc = contains paper, labbook and other documentation.

Subproject 0: Characterise

Aim:
To make sure data is suitable for future steps, by giving overview of demographics and data available. Also Cell Mix is performed to make sure everything is okay down the line. Further we regress out Covariates at this step using a linear model.

Script_Strategy:

  1. p0_CellMix_GAP_FEP_data_07_02_2017.Rmd
    • apply cellmix to identify poetential confounders from cell proportions.
  2. p0_characterise_GAP_FEP_data_07_02_2017.Rmd
    • Characterise cohort. Plot demographics, make tables, do stats.
  3. p0_Split_Data_GAP_FEP_data_07_02_2017.Rmd
    • Define split for all subproject (80-20). make sure to do this using Sex, Age, Ethnicty.

Output:

Subproject 1: Hypothesis Free

Aims:
Differential Expression adjusting for Ethnciity Age and Gender using Limma.

Input:
LumiBatch object.

Script_Strategy:

  1. p1_1_GAP_Feature_selection_07_02_2017.Rmd
    • Feature Selection
  2. p1_2_GAP_Machine_learning_07_02_2017.Rmd
    • Script that screens for best model.
    • Script also plots
  3. p1_3_GAP_Tuning_07_02_2017.Rmd
    • Script that tunes best 3 models.
    • Validate in test data.
  4. p1_4_GAP_Variables_test_07_02_2017.Rmd
    • Check classied samples by PANSS, sex, ethnicty, age, tobacco, ICD10, medication and PRS.
  5. p1_5_GAP_Enrichment_test_07_02_2017.Rmd
    • Find most important genes.
    • Check for enrichment in core genes.
  6. p1_6_GAP_Boosting_07_02_2017.Rmd
    • Use boosting.

Output:

Subproject 2: Hypothesis Driven

Aims:
Differential Expression adjusting for Ethnciity Age and Gender using Limma.

Input:
LumiBatch object.

Script_Strategy:

  1. p2_1_HD_GAP_Feature_selection_07_02_2017.Rmd
    • Feature Selection based on Purcell List.
  2. p2_2_HD_Machine_learning_07_02_2017.Rmd
    • Script that screens for best model.
    • Script also plots
    • Script that tunes best 3 models.
    • Validate in test data.
    • Check classied samples by PANSS, sex, ethnicty, age, tobacco, ICD10, medication and PRS.
    • Find most important genes.
    • Check for enrichment in core genes.
  3. p2_6_HD_Boosting_07_02_2017.Rmd
    • Use boosting.

File template

File: **

Description:

Source file name: **

Source:

Genes and Psychosis (GAP)

The Main Dataset is the Genes and Psychosis Data internal to the IoPPN.
The following files are associated to this Data.

File: Full Gene Expression Object
GAP_FEP_Full_Gene_Expression_Data_Linear.RData
Description: This file contains a lumibatch object with all probes deemed expressed. It is from the gene expression pre processing pipeline. Most probes here are defined as not expressed. It is First Episode Gap Samples, processed using background correction, log 2 transformed, robust spline normalisation. The following tech variables have been regressed out using a linear model:
ConcNanodrop, Dateout, concentrationoflabelledcRNA, DatecRNApurification

Source file name:
FINAL_GAP_DL_FEP.eset_bg_log2_rsn_SVA_Good.RData
Source:
Daniel Leirer
Daniel.Leirer@kcl.ac.uk

The following data contains the Polygenic Risk scores from gap snp data

Polygenic risk score data.
The following files are associated to this Data.

File: Full Gene Expression Object
GAP_FEP_Polygenic_Risk_scores.csv
Description: This file contains the identifies of samples and corresponding polygenic risk scores.

Source file name:
GAPsamples_Daniel_leirer_Polygenic_Risk_scores.csv
Source:
Evangelos Vassos
evangelos.vassos@kcl.ac.uk

The following files are sources for various parts of the data in the Lumibatch Object

File: Demographic Data GAP
Basic_Demographics_GAP.csv

Description:
Data approved by Marta Di Forti. Contains gene expression data, demographics etc. Age, Sex, Phenotype etc.

Source file name: GAP_full_final_expression_database_22_04_2015_Dan_Marta_consent.csv

Source:
Daniel Leirer created this document. Information from various sources within GAP.

File: GAP Master Database
GAP_large_demographic_database_16_Oct_2014.sav

Description:
Huge database containing a lot of information including PANSS data. This is a secordary database.

Source file name: Master_database_GAP_UPDATE_16_Oct_2014.sav

Source:
GAP team. Contact Robin Murray, Marta Di Forti, or people working in the Psychosis department.

File: Medication Data
Daniel_RNA_DQ.csv

Description:
Medication Data, Weight, Smoking, Some Demographics.

Source:
Diego Quattrone compiled this file. diego.quattrone@kcl.ac.uk

File: Pirooznia_enrichment_categories
Hypothesis_driven_source_Purcell_2014.csv

Description:
1796 genes list compiled by Purcell et al.

Source:
A polygenic burden of rare disruptive mutations in schizophrenia, 2014, Nature Purcell et al.
PMID: 24463508.
http://www.nature.com/nature/journal/v506/n7487/full/nature12975.html#tables

Acknoledgements

The Following people are involved in this project

Name: Daniel Leirer Role: PhD Student Email: daniel.leirer@kcl.ac.uk

Name: Dr. Stephen Newhouse Role: Main Supervisor Email: stephen.j.newhouse@gmail.com

Name: Professor Richard Dobson Role: Primary Supervisor Email: richard.j.dobson@kcl.ac.uk

Name: Sir Professor Robin Murray Role: Clinical Supervisor Email: N/A