:memo: Note: This repo is currently undergoing development. To access the version using for the encode_re2g paper, go to this version. There are currently no clear instructions for stitching together the outputs from ABC, e2g features, and e2g, so use at your own discretion. We are working on creating 1 clean pipeline for the future
ENCODE-rE2G is a logistic regression pipeline built on top of ABC. Given a chromatin accessibility input file, it will generate a list of enhancer-gene predictions. You can read the preprint paper here
Clone the repo and set it up for submodule usage
git clone --recurse-submodules git@github.com:EngreitzLab/ENCODE_rE2G.git
git config --global submodule.recurse true
We use ABC
as a submodule, so this command will initialize it and set up your git config to automatically keep the submodule up to date.
You'll need to use a certain model based on your input. (e.g DNase-seq or ATAC-seq? Do you have H3K27ac data?) We've pretrained all the models and determined the right thresholding to get E-G links at 70% recall of a CRISPR-validated E-G links.
Modify the ABC_BIOSAMPLES
field in config/config.yaml
to point to your ABC config. Read more about ABC config here.
external_features_config
in your biosamples_config.
See "Train model" section for details on this file.Activate a conda environment that has mamba installed.
mamba env create -f workflow/envs/encode_re2g.yml
conda activate encode_re2g
snakemake -j1 --use-conda
Based on your biosample config, we will find the right model to use for you. If we haven't trained that model before, an exception will get raised.
Output will show up in the results/
directory
results/{biosample_name}/{model_name}/encode_e2g_predictions_threshold.{threshold}.tsv.gz
results/{biosample_name}/{model_name}/encode_e2g_predictions.tsv.gz
with the score in a column named "ENCODE-rE2G.Score".We have pre-trained ENCODE-rE2G on certain model types. You can find them in the models
directory.
Each model must have the following:
model.pkl
corresponding to model_full.pkl
from the model training workflow)feature_table.tsv
, the corresponding feature table file from model training)threshold_0.XXX
where predictions with a score greater than 0.XXX are binarized as true links.The way we choose the model depends on the biosamples input. The code for model selection can be found here.
To override default model selection and specify a different model (either one you've trained yourself or the extended model), add a column called model_dir
to your biosample config. Multiple model directories can be specified as a comma-separated list. NOTE: The genome-wide feature tables to reproduce the ENCODE-rE2G_Extended model included in the prediction files on Synapse.org for K562 and GM12878. To use these feature tables, download the feature tables and remove the ".Feature" suffix from feature name columns.
Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)
Modify config/config_training.yaml
with your model and dataset configs
model_config
has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?), and override_params (are there model training parameters you would like to change from the default logistic regression settings specfied in config/config_training.yaml
?)
model_config
for how to specfiy override parameters. If there are no override_params, leave the column blank but still include the header.resources/feature_tables
) with columns: feature (name in final table), input_col (name in ABC output), second_input (multiplied by input_col if provided), aggregate_function (how to combine feature values when a CRISPR element overlaps more than one ABC element), fill_value (how to replace NAs), nice_name (used when plotting)dataset_config
is an ABC biosamples config to generate ABC predictions for datasets without an existing ABC directory. dataset_config
equals "dataset" in model_config
. If no ABC_directory is indicated in model_config
, it must have an entry in dataset_config
.reference_features
in config/config_training/yaml
), you must also define how to add these values with an external_features_config, which you include in dataset_config
in the optional column external_features_config:
external_features_config
has columns feature (corresponding to the missing input_col or second_input value), source_col (column name in the source file), aggregate_function (how to combine values when merging different element definitions), join_by, and source_fileActivate a conda environment that has mamba installed.
mamba env create -f workflow/envs/encode_re2g.yml
conda activate encode_re2g
snakemake -s workflow/Snakefile_training -j1 --use-conda
results/{biosample_name}/{model_name}/model_name/model_full.pkl
: full model trained on all chromosomes
results/{biosample_name}/{model_name}/model/training_predictions.tsv
: rE2G predictions on CRISPR training data, using leave 1 chromosome out models