DLeirer / PHD_ML_GAP_2017

2 stars 0 forks source link

title: "README_diff_expression_Project" author: "DJL" date: "15/09/2016" output: word_document: default pdf_document: toc: yes html_document: toc: yes toc_float: yes

Project Overview

This Project is part of Daniel Leirer's PhD. I work primarily with the Genes and Psychosis data from the IoPPN. The aim is to create a classifer for Psychosis.

Folder Structure

├── P0_Characterise/
│├── figs/
│└── output/
├┬─ P1_Hypothesis_Free/
│├── figs/
│└── output/
├┬─ P2_Hypothesis_Driven/
│├── figs/
│└── output/
├┬─ P3_Enviromental/
│├── figs/
│└── output/
├┬─ P4_Hybrid/
│├── figs/
│└── output/
├┬─ P11_glmnet_bootstrap/
│├── figs/
│├── Enrichment/
│└── output/
├┬─ P51_GAP_2way/
│├── figs/
│└── output/
├── R/
├── data/ ├── enviroment_backup/
└── doc/


Directory Function:
root directory = contains README and Project files for version control.
P0-PXX = subproject directories
figs = contains figures for subproject
output = contains output files from subproject.
R = Contains Scripts
data = contains all data (with exception of large data frames. Uploaded on Google Drive.)
enviroment_backup = folder for backing up enviroment to avoid dataloss. doc = contains paper, labbook and other documentation.

Subproject 0: Characterise

To make sure data is suitable for future steps, by giving overview of demographics and data available. Also Cell Mix is performed to make sure everything is okay down the line. Further we regress out Covariates at this step using a linear model.


  1. p0_CellMix_GAP_FEP_data_07_02_2017.Rmd
    • apply cellmix to identify poetential confounders from cell proportions.
  2. p0_characterise_GAP_FEP_data_07_02_2017.Rmd
    • Characterise cohort. Plot demographics, make tables, do stats.
  3. p0_Split_Data_GAP_FEP_data_07_02_2017.Rmd
    • Define split for all subproject (80-20). make sure to do this using Sex, Age, Ethnicty.


Subproject 1: Hypothesis Free

Differential Expression adjusting for Ethnciity Age and Gender using Limma.

LumiBatch object.


  1. p1_1_GAP_Feature_selection_07_02_2017.Rmd
    • Feature Selection
  2. p1_2_GAP_Machine_learning_07_02_2017.Rmd
    • Script that screens for best model.
    • Script also plots
  3. p1_3_GAP_Tuning_07_02_2017.Rmd
    • Script that tunes best 3 models.
    • Validate in test data.
  4. p1_4_GAP_Variables_test_07_02_2017.Rmd
    • Check classied samples by PANSS, sex, ethnicty, age, tobacco, ICD10, medication and PRS.
  5. p1_5_GAP_Enrichment_test_07_02_2017.Rmd
    • Find most important genes.
    • Check for enrichment in core genes.
  6. p1_6_GAP_Boosting_07_02_2017.Rmd
    • Use boosting.


Subproject 2: Hypothesis Driven

Differential Expression adjusting for Ethnciity Age and Gender using Limma.

LumiBatch object.


  1. p2_1_HD_GAP_Feature_selection_07_02_2017.Rmd
    • Feature Selection based on Purcell List.
  2. p2_2_HD_Machine_learning_07_02_2017.Rmd
    • Script that screens for best model.
    • Script also plots
    • Script that tunes best 3 models.
    • Validate in test data.
    • Check classied samples by PANSS, sex, ethnicty, age, tobacco, ICD10, medication and PRS.
    • Find most important genes.
    • Check for enrichment in core genes.
  3. p2_6_HD_Boosting_07_02_2017.Rmd
    • Use boosting.

File template

File: **


Source file name: **


Genes and Psychosis (GAP)

The Main Dataset is the Genes and Psychosis Data internal to the IoPPN.
The following files are associated to this Data.

File: Full Gene Expression Object
Description: This file contains a lumibatch object with all probes deemed expressed. It is from the gene expression pre processing pipeline. Most probes here are defined as not expressed. It is First Episode Gap Samples, processed using background correction, log 2 transformed, robust spline normalisation. The following tech variables have been regressed out using a linear model:
ConcNanodrop, Dateout, concentrationoflabelledcRNA, DatecRNApurification

Source file name:
Daniel Leirer

The following data contains the Polygenic Risk scores from gap snp data

Polygenic risk score data.
The following files are associated to this Data.

File: Full Gene Expression Object
Description: This file contains the identifies of samples and corresponding polygenic risk scores.

Source file name:
Evangelos Vassos

The following files are sources for various parts of the data in the Lumibatch Object

File: Demographic Data GAP

Data approved by Marta Di Forti. Contains gene expression data, demographics etc. Age, Sex, Phenotype etc.

Source file name: GAP_full_final_expression_database_22_04_2015_Dan_Marta_consent.csv

Daniel Leirer created this document. Information from various sources within GAP.

File: GAP Master Database

Huge database containing a lot of information including PANSS data. This is a secordary database.

Source file name: Master_database_GAP_UPDATE_16_Oct_2014.sav

GAP team. Contact Robin Murray, Marta Di Forti, or people working in the Psychosis department.

File: Medication Data

Medication Data, Weight, Smoking, Some Demographics.

Diego Quattrone compiled this file. diego.quattrone@kcl.ac.uk

File: Pirooznia_enrichment_categories

1796 genes list compiled by Purcell et al.

A polygenic burden of rare disruptive mutations in schizophrenia, 2014, Nature Purcell et al.
PMID: 24463508.


The Following people are involved in this project

Name: Daniel Leirer Role: PhD Student Email: daniel.leirer@kcl.ac.uk

Name: Dr. Stephen Newhouse Role: Main Supervisor Email: stephen.j.newhouse@gmail.com

Name: Professor Richard Dobson Role: Primary Supervisor Email: richard.j.dobson@kcl.ac.uk

Name: Sir Professor Robin Murray Role: Clinical Supervisor Email: N/A