Ribo-DT

Ribo-DT is a snakemake pipeline to infer single-codon and codon-pair dwell times as well as gene flux from ribosome profiling data using generalized linear model (GLM) with negative binomial noise (Gobet et al., PNAS, 2020).

Ribo-DT consists of a Snakefile, a conda environment file (Ribo_DT.yaml), a configuration file (config.yaml) and a set of R and perl scripts to infer ribosome dwell times and gene flux from raw ribosome profiling data.

Ribo-DT pipeline overview

Download the genome, cds and gtf files from ENSEMBL according to the species defined in the configuration file (config.yaml).
Build genome index using STAR.
Download SRA run files (SRR token) specified in the sample spreadsheet from GEO database.
Convert SRA to FASTQ files.
Merge FASTQ files run from the sample according to the the sample spreadsheet definition.
STAR alignment to genome with inline adapter clipping (defined in the configuration file).
Bam files indexing.
Positions of the 5’-end or 3’-end of the reads at the gene start codon are computed genome-wide for each read size in each frame. Respective density plots are reported for further verification by the user.
Read counts and cds positions are retrieved. The 5’- or 3’-end position of the read is shifted according to the A-site offset specified in the file computed at step 8.
Parse the downloaded cds file to be used as a reference in the GLM fit.
Load the parsed and read count files to generate the matrix for the fit.
Gene and position filtering. Fit the generalized linear model with the glm4 function.
Compute coefficients p-value and rescale the coefficients according to our convention (see method section in the paper).
Plot single and codon-pair dwell time heatmaps as well as fragment size distribution.
Output tables with p-values, standard errors, t-values and p-values.

Note that when RNA-seq and Ribo-seq are provided for the same sample, RNA-Seq is fitted first and used as an GLM offset in the Ribo-seq fit to reduce library preparation bias.

Running the Ribo-DT pipeline

Prerequisites
The current workflow is designed to run on high computing cluster with slurm workload manager but running_command.sh can be modified to match your cluster configuration. All the required packages and softwares are installed via the conda environment file (Ribo_DT.yaml).

Clone the github repository and change directory
Clone this git repository to the location where you want to run your analysis

git clone https://github.com/cgob/codonDT_snakemake.git
cd codonDT_snakemake

Download and install conda (select an appropriate version for your operating system)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Restart your terminal

Activate conda environment
This step install snakemake, all the required softwares, librairies, and dependencies in a conda environment to run the workflow.

conda env create -f Ribo_DT.yaml
conda activate Ribo_DT

Sample file
Edit the sample file (samples.tsv or your own file).

A minimal header with the fields "SAMPLES", "Type" and "SRR" is required.
Fill the sample names, type (RIBO for ribosome profiling dataset or RNA for rna-seq dataset) and GEO run accession number. To do so, extract the GEO accession number of the study of interest. Click on run selector, select your samples of interest and download the metadata text file. This file can be edited to meet the sample file requirements.
In case fastq files are directly provided, files names must be specified under the "SRR" field without the ".fastq" extension and placed in the "/workdir/data/raw/" output folder.

Configuration file
Edit the configuration file (config.yaml). Set:

workdir with the path of the output and working files.
homedir with the path of the Snakefile directory.
refdir with the path of the reference genome directory.
species with the proper species for your dataset (Mouse, Human, Yeast).
adapter with the 3'-adapter used during library preparation of your dataset.
L1 with the read size lower bound ( L1 > reads > L2 are kept for the fit).
L2 with the read size upper bound.
library with 'pos_neg' or 'neg_pos' depending on the strandness configuration of your library preparation.
samples with the tab-delimited file defined above describing your samples.
filter_1 with filter threshold for the minimum number of reads per gene.
filter_2 with p-value threshold for dwell times in the heatmap representation.
A_site_end with 5p or 3p defining which read ends to use to compute A site offsets from the pile-up densities at the start codons.

Running the pipeline

bash running_command.sh

Running the pipeline on a personal computer

snakemake --cores N

with N the number of CPU cores to be used at the same time.

Author

Cédric Gobet

cgob / codonDT_snakemake

readme

Ribo-DT

Ribo-DT pipeline overview

Running the Ribo-DT pipeline

Author