Streit-lab/enhancer_annotation_and_motif_analysis

Introduction

Streit-lab/enhancer_annotation_and_motif_analysis is a bioinformatic analysis pipeline for identifying enhancers associated to genes of interest and screening for motif binding sites.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a portable, reproducible manner.

Pipeline summary

Conditionally unzip genome (--fasta) and GTF/GFF (--gtf or --gff) files
Index genome in order to retrieve chromosome lengths
Filter genes of interest (--gene_list) from GTF and filter gene biotype entries in GTF/GFF
Conditionally extend length of peaks (--peaks) by a given length (--extend_peaks)
Assign TSS to peaks:

a) Assign TSS to peaks if they fall within CTCF sites flanking the peak of interest:
1. For each peak retrieve nearest CTCF sites upstream (CTCF start site) and downstream (CTCF end site)
2. Sort flanking CTCF coordinates
3. Annotate peaks to TSS within flanking CTCF sites
b) Assign TSS to peaks if they fall within an x.kb window of the peak of interest
Retrieve filtered peak fasta sequences
Calculate background base frequencies for motif screening
Identify motif binding sites in peaks (fimo)
Annotate peak-motif file with nearby genes

Quick Start

Install Nextflow (>=22.10.3)
Install any of Docker, Singularity (you can follow this tutorial).

Download the pipeline

nextflow pull Streit-lab/enhancer_annotation_and_motif_analysis

Test the pipeline on a minimal dataset with a single command:

nextflow run Streit-lab/enhancer_annotation_and_motif_analysis \
  -r main \
  -profile test,docker \
  --outdir output

Start running your own analysis!
- Typical command for Streit-lab/enhancer_annotation_and_motif_analysis analysis:
```
nextflow run Streit-lab/enhancer_annotation_and_motif_analysis \
  -r main \
  --fasta <FASTA_PATH_OR_URL> \
  --gtf <GTF_PATH_OR_URL> \
  --peaks_bed <PEAK_BED_FILE> \
  -profile <docker/singularity/conda>
```
- The pipeline comes with config profiles called docker, singularity and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
- If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
- If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.

Pipeline parameters

--fasta

Required. Path or URL to fasta file, can be gzipped.

--gtf or --gff

Required. Path or URL to GTF or GFF file, can be gzipped.

--peaks_bed

Required. Path to peak file in BED format. First four columns must contain; chrom, start, end, peakid. Example file.

--gene_ids

Optional. List of gene ids present in GTF to screen for enhancers and motifs. One gene id per line. Example file. If --gene_ids is not specified, all gene_ids will be extracted from the GTF or GFF. Default = null.

--extend_peaks

Optional. Number of bases by which to extend peaks (up and downstream). Default = 0.

--enhancer_window

Optional. Distance from TSS in GTF or GFF within which enhancers are screened. Default = 50000.

--ctcf

Optional. BED file containing co-ordinates for CTCF peaks to use for annotating enhancers to genes. If this argument is specified, the pipeline will annotate enhancers using CTCF windows rather than using --enhancer_window. Default = null.

--motif_matrix

Optional. By default the pipeline will screen against all motifs in the JASPAR core vertebrate non-redundant database --motif_matrix jaspar_core_vert_nonredundant_motifs. The redundant database can also be selected using --motif_matrix jaspar_core_vert_redundant_motifs. Alternatively, a path to matrix file in meme format can also be provided. Example file.

--markov_background

Optional. Markov background model used to define base frequencies for motif screening. This is calculated by default from the provided --fasta input.

--fimo_pval

Optional. p-value threshold used by FIMO for motif screening. Default = 0.0001.

--gene_name_col

Optional. Entry in GTF or GFF corresponding to gene names. Default = 'gene_name'.

--gene_id_col

Optional. Entry in GTF or GFF corresponding to gene IDs. Default = 'gene_id'.

--skip_motif_analysis

Optional. Boolean parameter which determines whether to run motif analysis after annotating enhancers. Default = false.

--outdir

Optional. Directory to output results to. Default = 'results'.