YeoLab / skipper

Skip the peaks and expose RNA-binding in CLIP data
Other
8 stars 3 forks source link

skipper

Skipper cartoon

Skip the peaks and expose RNA-binding in CLIP data

See published article in Cell Genomics: https://www.cell.com/cell-genomics/fulltext/S2666-979X(23)00085-X

Prerequisites

Skipper requires several executables and packages:

Tool Link
R https://www.r-project.org/
Python https://www.python.org/downloads/
Conda/Mamba https://conda.io/projects/conda/en/latest/user-guide/install/index.html
Snakemake https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
UMICollapse https://github.com/Daniel-Liu-c0deb0t/UMICollapse
Skewer https://github.com/relipmoc/skewer
Fastp https://github.com/OpenGene/fastp
bedtools https://github.com/arq5x/bedtools2
STAR https://github.com/alexdobin/STAR
Java https://jdk.java.net/20/
samtools http://www.htslib.org/download/
FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
HOMER http://homer.ucsd.edu/homer/introduction/install.html

For example, below are some commands for installing Miniconda.

curl -L -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"

bash Miniconda3-latest-Linux-x86_64.sh

Skipper requires several python and R packages. In order to install the precise versions used in the manuscript, we have provided skipper_env.yaml to install the used versions of R and corresponding packages from source.

Option 1: Manual installation (Linux-amd64)

Use conda to create a snakemake environment for installing required packages:

conda env create -f installation/skipper_env.yaml

Use the install_umicollapse.sh script to complete installation of UMICollapse v1.0.0 in the installation folder. Expect the whole process to take around 30 seconds.

cd installation && ./install_umicollapse.sh

Alternatively, at least as of this writing, Skipper is compatible with the newest version of R and its packages. The required R packages can be installed for an existing R installation as follows:

install.packages(c("tidyverse", "VGAM", "viridis", "ggrepel", "RColorBrewer", "Rtsne", "ggupset", "ggdendro", "cowplot"))

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("GenomicRanges","fgsea","rtracklayer"))

Paths to locally installed versions can be supplied in the config file, described below.

Option 2: Singularity installation (Linux-amd64)

conda create -n snakemake snakemake==7.32.3 star==2.7.10b

Singularity setup: https://docs.sylabs.io/guides/3.11/admin-guide/installation.html

Preparing to run Skipper

Skipper uses a Snakemake workflow. The Skipper.py file contains the rules necessary to process CLIP data from fastqs. Skipper also supports running on BAMs - note that Skipper's analysis of repetitive elements will assume that non-uniquely mapping reads are contained within the BAM files.

Providing an absolute path to the GitHub repository REPO_PATH will help Snakemake find resources regardless of the directory where Skipper is run.

Internal to the Yeo lab, setting the REPO_PATH to /projects/ps-yeolab3/eboyle/encode/pipeline/github/yeo will save time on preprocessing annotation files (check the annotation folder for HepG2, K562, or HEK293T. More annotations are available at /projects/ps-yeolab4/software/skipper/1.0.0/bin/skipper/annotations/).

Numerous resources must be entered in the Skipper_config.py file:

Resource Description
MANIFEST Information on samples to run
GENOME Samtools- and STAR-indexed fasta of genome for the sample of interest
STAR_DIR Path to STAR reference for aligning sequencing reads

Other paths to help Skipper run must be entered:

Path Description
EXE_DIR For convenience to point to stable locally installed software: it is added to PATH when Skipper runs
TOOL_DIR Directory for the tools located in the GitHub

Information about the CLIP library to be analyzed is also required:

Setting Description
UMI_SIZE Bases to trim for deduplication (10 for current eCLIP)
INFORMATIVE_READ Which read (1 or 2) reflects the crosslink site (for Paired End runs)
OVERDISPERSION_MODE Overdispersion can be estimated from multiple input replicates ("input") or multiple CLIP replicates ("clip"): "input" is recommended

Customizable input for Skipper

Skipper accepts customizable files for several steps, which are also entered in the Skipper_config.py file:

Input Description
GFF Gzipped gene annotation to partition the transcriptome and count reads.
PARTITION* Gzipped BED file of windows to test (can be generated from GFF file)
FEATURE_ANNOTATIONS* Gzipped TSV file with the following columns: chrom,start,end,name,score,strand,feature_id,feature_bin,feature_type_top,feature_types,gene_name,gene_id, transcript_ids,gene_type_top,transcript_type_top,gene_types,transcript_types (can be generated from GFF file)
BLACKLIST Removes windows from reproducible enriched window files. Start and end coordinates must match tiled windows exactly.
ACCESSION_RANKINGS A ranking of gene and transcript types present in the GFF to facilitate the transcriptome partitioning
REPEAT_TABLE Coordinates of repetitive elements, available from UCSC Genome Browser
REPEAT_BED* Gzipped sorted, nonoverlapping, tab-delimited annotations of repetitive elements: chr,start,end,label,score,strand,name,class,family,proportion_gc
GENE_SETS GMT files of gene sets for gene set enrichment calculation
GENE_SET_REFERENCE TSV of gene set name, number of windows belonging to term, and fraction of windows that lie in gene set genes
GENE_SET_DISTANCE RDS of a matrix containing jaccard index scores for all pairs of gene sets in GMT file

*Skipper can generate these files from other input, or you can make your own versions with the appropriate columns.

Want to make your own partition from RNA-seq of a sample? Run the tools/subset_gff.py script on RNA-seq quantifications from Salmon. We used a 1 TPM cutoff. Enter the resulting file for the GFF. This makes the window annotations more accurate but we haven’t carefully examined how important it is for the cell sample to match.

Making a manifest

Column Description
Experiment CLIP samples will be compared against Input samples within an experiment. The same sample can be used in multiple experiments
Sample Each CLIP and Input sample will be processed separately until testing for differential binding
Cells A place to record information on the cell sample used: this is not currently used in analysis
Input_replicate Replicate # for the same Sample. The same Input replicate (fastq and number) can be used for multiple CLIP replicates
Input_adapter Fasta of adapter sequences for Input replicate
Input_fastq Path to Input replicate fastq (multiple files can be entered per cell to be concatenated
Input_bam (Optional) Enter path to Input BAM file
CLIP_replicate Replicate # for the same Sample. Distinct CLIP replicates are required
CLIP_adapter Fasta of adapter sequences for CLIP replicate
CLIP_fastq Path to CLIP replicate fastq (multiple files can be entered per cell to be concatenated
CLIP_bam (Optional) Enter path to CLIP BAM file

Skipper requires multiple CLIP replicates of the same sample to call reproducible windows. Enter multiple replicates with the same experiment and sample columns on separate lines, incrementing the replicate number for each replicate. The same input replicate can be used in multiple experiments and repeated for the same sample if you estimate overdispersion from CLIP replicates. If the same replicate is used for multiple comparisons, the sample and replicate columns must be consistent.

See the example manifest in the example folder for the exact formatting and to test running Skipper by downloading the example dataset: https://zenodo.org/records/10636793.

Running Skipper

Skipper can be run like any other Snakemake workflow.

Create a new directory to store output, copy the Snakemake and config files, and make all edits necessary to the config file. In the all rule of the Skipper.py file, comment out output that you do not wish to inspect.

Remember to load the Snakemake environment before running

conda activate snakemake

Use the dry run function to confirm that Snakemake can parse all the information:

snakemake -ns Skipper.py -j 1

Once Snakemake has confirmed DAG creation, if applicable, submit the jobs using high performance computing infrastructure options suit you:

Option 1: Manually installed packages

snakemake -kps Skipper.py -w 15 -j 30

snakemake -kps Skipper.py -w 15 -j 30 --cluster "sbatch -t {params.run_time} -e {params.error_file} -o {params.out_file} -p condo -q condo -A csd792 --tasks-per-node {threads} --job-name {params.job_name} --mem {params.memory}"

Option 2: Singularity

snakemake -kps Skipper.py -w 15 -j 30 --use-singularity --singularity-args "--bind /tscc"

snakemake -kps Skipper.py -w 15 -j 30 --use-singularity --singularity-args "--bind /tscc" --cluster "sbatch -t {params.run_time} -e {params.error_file} -o {params.out_file} -p condo -q condo -A csd792 --tasks-per-node {threads} --job-name {params.job_name} --mem {params.memory}"

Did Skipper terminate? Sometimes jobs fail - inspect any error output and rerun the same command if there is no apparent explanation such as uninstalled dependencies or a misformatted input file. Snakemake will try to pick up where it left off.

Skipper output

Skipper produces numerous output files. The output/figures directory contains figures summarizing the data. Output Description
all_reads Visualization of RNA region preferences based on total reads instead of called windows
threshold_scan Visualization of selection of minimum read coverage for statistical testing
input_distributions Visualization of betabinomial fits to aggregate data
enriched_windows QC of called enriched windows
enrichment_concordance Mosaic plot of agreement between called enriched windows between replicates
enrichment_reproducibility Number of total and enriched windows as a function of the number of replicates included
reproducible_enriched_windows Visualization of RNA region preferences for windows called by at least two replicates
gene_sets Visualization of top enriched GO terms relative to ENCODE reproducible enriched windows
clip_scatter_re Visualization of enriched repetitive elements
tsne t-SNE visualization of binding preferences releative to ENCODE RBPs

Key outputs: Annotated reproducible enriched windows can be accessed at output/reproducible_enriched_windows/ and Homer motif output is at output/homer/

Example CLIP fastqs and processed data are available at GEO and SRA: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE213867