EngreitzLab / ENCODE_rE2G

Enhancer-Gene Prediction Pipeline based on logistic regression and ABC model
4 stars 0 forks source link

CircleCI CircleCI

ENCODE-rE2G

:memo: Note: This repo is currently undergoing development. To access the version using for the encode_re2g paper, go to this version. There are currently no clear instructions for stitching together the outputs from ABC, e2g features, and e2g, so use at your own discretion. We are working on creating 1 clean pipeline for the future

ENCODE-rE2G is a logistic regression pipeline built on top of ABC. Given a chromatin accessibility input file, it will generate a list of enhancer-gene predictions. You can read the preprint paper here

image

Set up

Clone the repo and set it up for submodule usage

git clone --recurse-submodules git@github.com:EngreitzLab/ENCODE_rE2G.git
git config --global submodule.recurse true

We use ABC as a submodule, so this command will initialize it and set up your git config to automatically keep the submodule up to date.

Apply a pretrained model

You'll need to use a certain model based on your input. (e.g DNase-seq or ATAC-seq? Do you have H3K27ac data?) We've pretrained all the models and determined the right thresholding to get E-G links at 70% recall of a CRISPR-validated E-G links.

Modify the ABC_BIOSAMPLES field in config/config.yaml to point to your ABC config. Read more about ABC config here.

Activate a conda environment that has mamba installed.

mamba env create -f workflow/envs/encode_re2g.yml
conda activate encode_re2g
snakemake -j1 --use-conda

Based on your biosample config, we will find the right model to use for you. If we haven't trained that model before, an exception will get raised.

Output will show up in the results/ directory

Supported Models

We have pre-trained ENCODE-rE2G on certain model types. You can find them in the models directory. Each model must have the following:

  1. model pickle file (model.pkl corresponding to model_full.pkl from the model training workflow)
  2. feature table file (feature_table.tsv, the corresponding feature table file from model training)
  3. threshold file (threshold_0.XXX where predictions with a score greater than 0.XXX are binarized as true links.

The way we choose the model depends on the biosamples input. The code for model selection can be found here.

To override default model selection and specify a different model (either one you've trained yourself or the extended model), add a column called model_dir to your biosample config. Multiple model directories can be specified as a comma-separated list. NOTE: The genome-wide feature tables to reproduce the ENCODE-rE2G_Extended model included in the prediction files on Synapse.org for K562 and GM12878. To use these feature tables, download the feature tables and remove the ".Feature" suffix from feature name columns.

Train model

Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)

Modify config/config_training.yaml with your model and dataset configs

Activate a conda environment that has mamba installed.

mamba env create -f workflow/envs/encode_re2g.yml 
conda activate encode_re2g 
snakemake -s workflow/Snakefile_training -j1 --use-conda

Output

results/{biosample_name}/{model_name}/model_name/model_full.pkl: full model trained on all chromosomes results/{biosample_name}/{model_name}/model/training_predictions.tsv: rE2G predictions on CRISPR training data, using leave 1 chromosome out models