Uniform processing pipeline and peak caller for STARR-seq data
Preferably, create a conda environment with Python 2.7
conda create -n starrpeaker python=2.7 pybedtools
conda activate starrpeaker
pip install git+https://github.com/gersteinlab/starrpeaker
starrpeaker -h
Few notes on how alignment (BAM) files were prepared
For each biological replicates in FASTQ format
samtools view -F 3852 -f 2 -q 40
4. Merged biological replicates using SAMtools
## Inputs
* Input alignment (BAM) file (STARR-seq input)
* Output alignment (BAM) file (STARR-seq output)
* Covariates (BigWig) file(s)
* Chrom Size file (i.e., https://github.com/gersteinlab/starrpeaker/blob/master/data/GRCh38.chrom.sizes.simple.sorted or https://hgdownload-test.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
* Blacklist (BED) file (i.e., https://github.com/gersteinlab/starrpeaker/blob/master/data/ENCODE_blacklist_GRCh38_ENCFF419RSJ_merged.bed or https://www.encodeproject.org/files/ENCFF419RSJ/)
## Covariates
The peak calling algorithm models STARR-seq fragment coverage across the genome using multiple covariates to correct for potential sequencing bias. It is recommended to include potential confounding variables into the model. These include but not limited to GC-content, mappability tracks, and so on.
The following covariates have been pre-computed for use with STARRPeaker:
* [GRCh38 GC content](https://drive.google.com/file/d/1ZSznITGkqPLXzqAKElnjUrmCAtSZo2Gk/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_GRCh38_ucsc-gc-5bp.bw)
* [GRCh38 mappability](https://drive.google.com/file/d/1ZSM7V_jgf61QGMpBfK8udOOMjq1QlkQU/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_GRCh38_gem-mappability-100mer.bw)
* [GRCh38 RNA folding energy](https://drive.google.com/file/d/1ZSO6q_b3alp82uApyOMnewdxsFy7ph1y/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_GRCh38_linearfold-folding-energy-100bp.bw)
* [hg19 GC content](https://drive.google.com/file/d/1Zkc9cxIePtWsyaJ_JQzHpg6D0c7D_9jX/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_hg19_ucsc-gc-5bp.bw)
* [hg19 mappability](https://drive.google.com/file/d/1ZTNPszH1s7p4-sagDVxHLuFXZBzBPDbd/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_hg19_gem-mappability-100mer.bw)
* [hg19 RNA folding energy](https://drive.google.com/file/d/1ZVwJeSKYQ12328AhrB7JieJV8QkvhQQt/view?usp=sharing) [(backup)](http://archive2.gersteinlab.org/proj/starrpeaker/STARRPeaker_cov_hg19_linearfold-folding-energy-100bp.bw)
Sources:
* GRCh38 genome: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
* GRCh38 GC content: https://hgdownload.soe.ucsc.edu/gbdb/hg38/bbi/gc5BaseBw/gc5Base.bw
* GRCh38 mappability: (computed using gem-library)
* GRCh38 RNA folding energy: (computed using linearfold; see MS for details)
* hg19 genome: https://www.encodeproject.org/files/male.hg19/
* hg19 GC content: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/gc5Base/hg19.gc5Base.txt.gz
* hg19 mappability: (computed using gem-library)
* hg19 RNA folding energy: (computed using linearfold; see MS for details)
## Usage
usage: starrpeaker.py [-h] --prefix PREFIX --chromsize CHROMSIZE --blacklist BLACKLIST -i INPUT -o OUTPUT [--length LENGTH] [--step STEP] [--cov COV [COV ...]] [--min MIN] [--max MAX] [--readstart] [--strand STRAND] [--threshold THRESHOLD] [--mode MODE] [--mincov MINCOV] [--eq EQ] [--se] [--minfc MINFC] [--capture CAPTURE] [--slop SLOP]
STARRPeaker
optional arguments: -h, --help show this help message and exit --prefix PREFIX Output File Prefix --chromsize CHROMSIZE Chrom Sizes --blacklist BLACKLIST Blacklist Region in BED format -i INPUT, --input INPUT Input BAM File -o OUTPUT, --output OUTPUT STARR-seq BAM File --length LENGTH Bin Length --step STEP Step Size --cov COV [COV ...] Covariate BigWig Files --min MIN Minimum Template Size --max MAX Maximum Template Size --readstart Use Read Start Position instead of Fragment Center --strand STRAND Use all/fwd/rev Stranded Fragments --threshold THRESHOLD Adjusted P-value Threshold --mode MODE Mode --mincov MINCOV Minimum Coverage --eq EQ Extreme Quantile to Remove --se Use Single-End instead of Paired-end Sequencing --minfc MINFC Minumum Fold Change --capture CAPTURE Capture Region in BED format --slop SLOP Extend Capture Region in each direction
## Example
starrpeaker --prefix
## Outputs Files
* *prefix*.bin.bed: Genomic bin BED file
* *prefix*.bam.bct: Alignment counts in BST format (1st col: input, 2nd col: output, 3rd col: normalized input)
* *prefix*.cov.tsv: Covariate matrix in TSV format
* *prefix*.input.bw: Input fragment coverage in bigWig format
* *prefix*.output.bw: Output fragment coverage in bigWig format
* *prefix*.peak.bed: Initial peak calls (before centering and merging)
* *prefix*.peak.final.bed: Final peak calls
* *prefix*.peak.pval.bw: P-value track in bigWig format (-log10)
* *prefix*.peak.qval.bw: Q-value track in bigWig format (-log10)
## Final Peak Call Format (v1.1 and above; BED6+5)
* Column 1: Chromosome
* Column 2: Start position
* Column 3: End position
* Column 4: Name (peak rank based on score, 1 being the highest rank)
* Column 5: Score (integer value of "100 * fold change", maxed at 1000 per BED format specification)
* Column 6: Strand
* Column 7: Log2 Fold change (normalized output/input ratio, in log2 space)
* Column 8: Input fragment coverage (total fragments across/within replicate(s))
* Column 9: Output fragment coverage (total fragments across/within replicate(s))
* Column 10: -log10 of P-value
* Column 11: -log10 of Q-value (Benjamini-Hochberg False Discovery Rate, FDR)
*ENCODE MPRA/STARR-seq BED6+5 common file format: https://www.encodeproject.org/documents/9f5d2b5a-bd29-4983-9c01-fab4ab8b5ea2/*
## Final Peak Call Format (up to v1.0; BED6+4)
* Column 1: Chromosome
* Column 2: Start position
* Column 3: End position
* Column 4: Name (peak rank based on score, 1 being the highest rank)
* Column 5: Score (integer value of "100 * fold change", maxed at 1000 per BED format specification)
* Column 6: Strand
* Column 7: Fold change (output/normalized-input)
* Column 8: Output fragment coverage
* Column 9: -log10 of P-value
* Column 10: -log10 of Q-value (Benjamini-Hochberg False Discovery Rate, FDR)
*BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html#format1*