TOBIAS or "Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal" is a framework of tools for investigating transcription factor binding from ATAC-seq signal. The analysis involves numerous sequential steps (or tasks) to be executed in order to successfully predict TF occupancy footprint from deduplicated alignment BAM files of ATACseq raw data (fastq files). Here we use Snakemake to automate the sequential execution on any HPC. Most tools used by the pipeline are completely containerized in docker format and can be invoked using singularity on the HPC. The minimum requirements for running this pipeline are:
This pipeline was built using the CCBR_SnakemakePipelineCookiecutter.
Please visit the following pages for more details directly from the authors of TOBIAS:
Various version of the pipeline have been checked out at /data/CCBR_Pipeliner/Pipelines/CCBR_tobias
on biowulf. You can get help about running the pipeline using:
% bash /data/CCBR_Pipeliner/Pipelines/CCBR_tobias/v0.2/run_tobias.bash --help
Pipeline Dir: /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CCBR_tobias/v0.2
Git Commit/Tag: 6c8726023269ace0fd8fe886a1213859b363f9fd v0.2
/data/CCBR_Pipeliner/Pipelines/CCBR_tobias/v0.2/run_tobias.bash: run CCBR TOBIAS workflow for ATAC seq data
USAGE:
bash /data/CCBR_Pipeliner/Pipelines/CCBR_tobias/v0.2/run_tobias.bash -m/--runmode=<MODE> -w/--workdir=<path_to_workdir>
Required Arguments:
1. RUNMODE: [Type: String] Valid options:
*) init : initialize workdir
*) run : run with slurm
*) reset : DELETE workdir dir and re-init it
*) dryrun : dry run snakemake to generate DAG
*) unlock : unlock workdir if locked by snakemake
*) runlocal : run without submitting to sbatch
2. WORKDIR: [Type: String]: Absolute or relative path to the output folder with write permissions.
The pipeline requires only 2 arguments:
Generally, we anticipate CCBR_tobais to be run in 3 steps:
% bash /data/CCBR_Pipeliner/Pipelines/CCBR_tobias/dev/run_tobias.bash -m=init -w=/path/to/outfolder
This creates the output folder, so it should not exists before running init
. Along with other scripts and files, init
copies config.yaml
and cluster.json
to the output folder, which can then be edited by the user. Some key input values that need to be edited before running the pipeline are as follows:
data
: points to the CCBR_ATACseq dedup.bam
replicate files per sample. The sample names should match those later used in contrasts
contrasts
: which contrasts to perform using TOBIAS. The 2 groups should be already defined under data
peaks
: areas of interests to query for differential foot printing. This should be manually curated before running CCBR_tobias pipeline
genome
: currently supports mm10 for mouse with Gencode M21 annotation and hg38 for human Gencode v30 annotation.
motifs
: motif database to use for analysis. The choices are:
database | organism | version |
---|---|---|
HOCOMOCO_v11 | Human | Core |
HOCOMOCO_v11 | Human | Full |
HOCOMOCO_v11 | Mouse | Core |
HOCOMOCO_v11 | Mouse | Full |
HOCOMOCO_v11 | Human+Mouse | Core |
HOCOMOCO_v11 | Human+Mouse | Full |
JASPAR2020 | - | core_nonredundant |
JASPAR2020 | - | core_redundant |
JASPAR2020 | vertebrate | core_nonredundant |
JASPAR2020 | vertebrate | core_redundant |
% bash /data/CCBR_Pipeliner/Pipelines/CCBR_tobias/dev/run_tobias.bash -m=dryrun -w=/path/to/outfolder
Running the above command ensures that
config.yaml
files and makes sure that we have appropriate permissions to the input files and output locationsdry-run
mode using the cluster.json
to enlist a table of rules/tasks to be run After successfully running dryrun
, the user can run the same command with -m=run
option to submit jobs to the slurm job scheduler on biowulf. By default, the norm
partition is used to running jobs, but that and other job parameters can be changed by editing the cluster.json
file in the output folder.
The following folders are expected upon successful completion.
Individual replicate alignment BAMs are merged together and pre-sorted. This folder will contains the merged BAMs
The merged BAMs are converted to normalized bigwigs for visualization with IGV. The bigwigs can be found here.
The merged BAMs from the bams
folder are corrected for Tn5 insertion bias. 4 separate bigwigs are expected as output on a per-condition basis:
Using the bias corrected corrected bigwig a per-condition footprinting bigwig is created limited to the "regions of interest" defined by the peaks
in the config.yaml
.
Supplied peaks
are annotated using UROPA and annotations are stored here.
One TFBS folder is create for each contrast. There are created by running bindetect
. Each TFBS folder contains numerous (100s) subfolders, one for each motif in the motif database selected using motifs
parameter in config.yaml
. Each of these per-TF-motif subfolder also has a standard folder structure including a subfolder name beds
. This contains:
peaks
parameter in config.yaml
More more details see https://github.com/loosolab/TOBIAS/wiki/BINDetect
Caution This folder has a large digital footprint. Approximately, each contrast produces files amounting to about 40-60 GB. Hence, only run those contrasts that are interesting. DO NOT RUN ALL JUST BECAUSE YOU CAN!
This folder also contains:
which are the key results for this contrast as a table and as plots.
All "bound" bed for all the TF motifs considered are concatenated together to be reported here as 2 sorted and indexed bed files. As these are indexed they can be easily loaded in a IGV session for visual inspection.
TF-TF binding networks are created with TOBIAS CreateNetwork
for the first condition in each contrast.
An adjacency matrix and a list of edges are reported individually for each TF motif and summarized overall for each network.