frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

PAPA - Pipeline Alternative polyadenylation (APA)

Snakemake pipeline for detection & quantification of novel last exons/polyadenylation events from bulk RNA-sequencing data

The workflow (in brief) is as follows, but can be toggled depending on your use case:

Please consult manuscript (coming soon to biorxiv) for a detailed description of the workflow

Input files & dependencies

The pipeline requires as input:

Please see notes on data compatibility for further information.

The pipeline makes use of Snakemake & conda environments to install pipeline dependencies. If you do not have a conda installation on your system, head to the conda website for installation instructions. mamba can also be used as a drop in replacement (and is generally recommended as it's much faster than conda!).

Assuming you have conda/mamba available on your system, the recommended way to run the pipeline is to use the 'execution' conda environment, which will install the required Snakemake version and other dependencies (including mamba for faster conda environment installation):

<conda/mamba> env create -f envs/snakemake_papa.yaml

Once installation is complete, you can activate the environment with the following command:

conda activate papa_snakemake

Configuration

Note: the default config file is set up ready to run with test data packaged with the repository. If you'd like to see example outputs, you can skip straight to running the pipeline

Config YAML file

All pipeline parameters and input are declared in config/config.yaml, which needs to be customised for each run. All parameters are described in comments in the config file. The first section defines input file locations described above and the output directory destination (see comments in config/config.yaml for further details). See comments above each parameter for details.

The pipeline is modular and has several different run modes. Please see workflow control section at the end of the README for further details.

Sample sheet CSV

Sample information and input file relationships are defined in a CSV sample sheet with the following columns (an example can be found at config/test_data_sample_tbl.csv):

Running the pipeline

Once you are satisfied with the configuration, you should perform a dry run to check that Snakemake can find all the input files and declare a run according to your parameterisation. Execute the following command (with papa_snakemake conda environment active as described above):

snakemake -n -p --use-conda

If you are happy with the dry run, you can execute the pipeline locally with the following command, replacing with the number of cores you wish to use for parallel execution:

snakemake -p --cores <integer> --use-conda

Note: Conda environments will need to be installed the first time you run the pipeline, which can take a while

Output

Output is organised into the following subdirectories:

$ tree -d -L 1 test_data_output
test_data_output
├── benchmarks
├── differential_apa
├── logs
├── salmon
├── stringtie
└── tx_filtering

For a full description of output files, field descriptions etc. see output_docs.md

Workflow control

The 'General Pipeline Parameters' declares how to control the workflow steps that are performed. The pipeline is split into modules - 'identification', 'quantification' and 'differential usage':

Running modes when identification is not performed

When run_identification is set to False, a reference 'last exon-ome' must still be generated/provided as input to Salmon. There are a few possibilities:

1. Just use reference GTF to construct Salmon index

2. Use a pre-specified set of novel last exons to combine with reference last exons and construct Salmon index

3. Use a pre-computed Salmon index (+ last exon metadata) from a previous PAPA run

Compatible data

License

Because Salmon is a core dependency and distributed under the GNU General Public License v3.0 (GPLv3) licence, PAPA is also licensed under GPLv3 (Open Source Guide).