Snakemake pipeline for detection & quantification of novel last exons/polyadenylation events from bulk RNA-sequencing data
The workflow (in brief) is as follows, but can be toggled depending on your use case:
Please consult manuscript (coming soon to biorxiv) for a detailed description of the workflow
The pipeline requires as input:
samtools faidx
)Please see notes on data compatibility for further information.
The pipeline makes use of Snakemake & conda environments to install pipeline dependencies. If you do not have a conda installation on your system, head to the conda website for installation instructions. mamba can also be used as a drop in replacement (and is generally recommended as it's much faster than conda!).
Assuming you have conda/mamba available on your system, the recommended way to run the pipeline is to use the 'execution' conda environment, which will install the required Snakemake version and other dependencies (including mamba for faster conda environment installation):
<conda/mamba> env create -f envs/snakemake_papa.yaml
Once installation is complete, you can activate the environment with the following command:
conda activate papa_snakemake
Note: the default config file is set up ready to run with test data packaged with the repository. If you'd like to see example outputs, you can skip straight to running the pipeline
All pipeline parameters and input are declared in config/config.yaml
, which needs to be customised for each run. All parameters are described in comments in the config file. The first section defines input file locations described above and the output directory destination (see comments in config/config.yaml
for further details). See comments above each parameter for details.
The pipeline is modular and has several different run modes. Please see workflow control section at the end of the README for further details.
Sample information and input file relationships are defined in a CSV sample sheet with the following columns (an example can be found at config/test_data_sample_tbl.csv
):
sample_name
- unique identifier for sample. Identifiers must not contain hyphen characters (i.e. '-')condition
- key to group samples of the same condition of interest (e.g. 'control' or 'knockdown'). Note that first value in the first row is considered the 'base' condition/denominator for differential usage analysis.path
- path to BAM file containing aligned reads for given sample. BAM files should be coordinate sorted and have a corresponding BAI index file at the same location (suffixed with '.bai')fastq1
- path to FASTQ file storing 1st mates of read pairs aligned for same samplefastq2
- path to FASTQ file storing 2nd mates of read pairs aligned for same sample. If dataset is single-end, leave this field blankOnce you are satisfied with the configuration, you should perform a dry run to check that Snakemake can find all the input files and declare a run according to your parameterisation. Execute the following command (with papa_snakemake
conda environment active as described above):
snakemake -n -p --use-conda
If you are happy with the dry run, you can execute the pipeline locally with the following command, replacing
snakemake -p --cores <integer> --use-conda
Note: Conda environments will need to be installed the first time you run the pipeline, which can take a while
Output is organised into the following subdirectories:
$ tree -d -L 1 test_data_output
test_data_output
├── benchmarks
├── differential_apa
├── logs
├── salmon
├── stringtie
└── tx_filtering
benchmark
directive (wall clock time, memory usage) for each job.test_data_output/differential_apa/dexseq_apa.results.processed.tsv
test_data_output/differential_apa/summarised_pas_quantification.<counts/gene_tpm/ppau/tpm>.tsv
test_data_output/tx_filtering/all_conditions.merged_last_exons.3p_end_filtered.gtf
test_data_output/tx_filtering/novel_ref_combined.last_exons.gtf
test_data_output/tx_filtering/novel_ref_combined.quant.last_exons.gtf
For a full description of output files, field descriptions etc. see output_docs.md
The 'General Pipeline Parameters' declares how to control the workflow steps that are performed. The pipeline is split into modules - 'identification', 'quantification' and 'differential usage':
run_identification
parameter - controls whether to perform novel last exon discovery with StringTie + filtering.run_differential
parameter - controls whether to perform differential usage between conditions with DEXSeqWhen run_identification
is set to False, a reference 'last exon-ome' must still be generated/provided as input to Salmon. There are a few possibilities:
use_provided_novel_les
and use_precomputed_salmon_index
to Falseuse_provided_novel_les
to Trueinput_novel_gtf
parameteruse_precomputed_salmon_index
to Trueprecomputed_salmon_index
tx2le
, tx2gene
, le2gene
and info
parametersAs long as the reference poly(A) site file is in BED file, in theory any database (e.g. PolyADB, Gencode manual annotation, custom annotation) can be used
Because Salmon is a core dependency and distributed under the GNU General Public License v3.0 (GPLv3) licence, PAPA is also licensed under GPLv3 (Open Source Guide).