Novartis / EQP-cluster

Unix-based RNA-seq quantification pipeline
Apache License 2.0
10 stars 3 forks source link

Exon quantification pipeline (EQP-cluster)

EQP-cluster is a Unix-based RNA-seq quantification pipeline which takes a set of sample Fastq files as input, aligns them against reference files, and generates files with the gene, exon, or junction counts for each sample.

It provides scripts to facilitate the distributed execution of alignment and quantification operations for each of the samples and is designed to support the Univa* Grid Engine (UGE) batch-queuing/scheduling system (job submission via qsub).

If only the quantification step is of interest and your Fastq files are pre-aligned, then you can also use
https://github.com/novartis/EQP-QM
which is a Unix based RNA-seq quantification module; it uses SAM/BAM genome alignment files as input and creates gene, exon, and junctions counts.


* Univa® is a registered trademark of Univa Corporation

Version

This document corresponds to EQP-cluster version 2.1.0

Installation of EQP-cluster

After downloading and unpacking the files from GitHub, the download directory should contain the following entries:
LICENSE.txt bin/ exon-pipeline-scripts/ README.md exon-pipeline-files.tgz samples.tgz

To complete the installation of EQP, go to the bin directory and call the script create-bowtie2-indices.sh:
cd bin
create-bowtie2-indices.sh

Note that create-bowtie2-indices.sh requires that the programs samtools and bowtie2-build are installed. See below for the version requirements.

The script create-bowtie2-indices.sh extracts the files from the tar file exon-pipeline-files.tgz into the directory exon-pipeline-files and launches three calls to bowtie2-build in the background to construct the Bowtie2 indices of the genome Fasta file, the transcript Fasta file, and a custom junction Fasta file. The computation of the Bowtie2 indices will take several hours.

Limitations:
Note that currently EQP-cluster makes use of human Ensembl76 reference files (on which the files contained in the directory exon-pipeline-files are based) and is, thus, configured to quantify only human data; furthermore, due to the way the custom junction Fasta file is created EQP-cluster works best with reads of length 101bp or less.

Other reference file package will be made available in the future.

Once the Bowtie2 indices are computed, EQP-cluster is ready to be used. The directory containing the subdirectories bin, exon-pipeline-scripts, and exon-pipeline-files will be called <project directory> in the following.

Dependencies

Running EQP-cluster

For the processing of samples EQP relies one a fairly rigid directory structure in which reference files are located at predefined locations in the exon-pipeline-files subdirectory, scripts and programs are located at predefined locations in the exon-pipeline-scripts subdirectory, and Fastq files are located at predefined locations in the samples subdirectory. In following we describe how to setup a project specific directory structure in the samples subdirectory.

Creating a samples directory

For each of your samples you need to create a subdirectory in <project directory>/samples. For each sample directory EQP assumes that it contains one or two Fastq files (or links to Fastq files) called either .fastq.gz or _1.fastq.gz and _2.fastq.gz depending on whether you want to process single-read or paired-end reads. See the document manual/EQP-manual.pdf for more information on how create samples directories in a more automated way using the convenience script create-fastq-links.sh.

After the creation of the sample directories, the content of the

should be: `LICENSE.txt exon-pipeline-files/ samples/` `README.md exon-pipeline-scripts/` `bin/ manual/` ### Generating the counts After the creation of the sample directories EQP is ready for execution using the script `process-samples.sh` in the subdirectory `bin`. A typical (and complete) processing of the Fastq files in the samples subdirectory is accomplished by the following calls: 1. `process-samples.sh split` Split the input Fastq files into chunks of 25M reads and filter out reads that originate completely from the poly-A tails. Please note that this step needs to be executed even if the input Fastq files contain less than 25M reads as EQP changes the reads identifiers in this step to a format of which is required in the `compute-counts` step below. All sub-processes must finish before running the next command. 2. `process-samples.sh align-all` Align all chunk Fastq files against the transcript, genome, and junction Fasta reference file. 3. `process-samples.sh compute-counts` Generate the gene, exon, and junction counts for each chunk. This step can be executed directly after the alignment step without the need to wait for the completion of the jobs. Again all sub-processes must finish before the next call. 4. `process-samples.sh combine-counts` Combine the counts of all chunks of one sample into an aggregate count and combine the counts for the samples in a file with has a matrix structure with the gene, exon, or junction identifiers as the rows and the counts for each sample as the columns. Please read the document `manual/EQP-manual.pdf` for a much detailed description of the different operations and various options, for instance, to check the status of the submitted jobs. ### Location of the results The results can then be found in the files: `/samples/count-files/-bowtie2-pe-.cnt` where * `` is the base name of , that is, just the name of the directory without the complete path * `` is one of gene, exon, or junction. EQP-cluster and UGE ------------------- EQP is designed to work with the Univa® Grid Engine (UGE, formerly Sun Grid Engine, SGE). Most of the processing steps are submitted to the UGE scheduler via the `qsub` command (with the notable exception of `process-samples.sh combine-counts` which is executed on the machine on which the command is issued). Once submitted, a job is then executed asynchronously on a node of the cluster managed by UGE. Below we show which resources are used by the `qsub` call of EQP. It is also possible to run EQP without UGE. In this case the option: ` -noqsub ` needs to be supplied to all commands. With this option EQP launches processes in the background until the number of cores given by the `-noqsub` option is exhausted. Usually, the different commands require more than one core: * `split` requires one core * `align-all` six cores * `compute-counts` four cores Note that the number of cores used by EQP will be at most the number given after `-noqsub`. If this number is smaller than the minimum number of required cores for one process, then, nevertheless, at least one process will be launched using the number of cores given by the `-noqsub` option. The format of the qsub command used by EQP is as follows: `qsub -V -l m_mem_free= -l h_rt= -N -hold_jid ` ` -o -pe smp ` The parameters `-V, -N, -hold_jid, -o`, and `-pe smp` are generally accepted by UGE `qsub` whereas the parameters `-l m_mem_free= -l h_rt=` are dependent on the configuration of the UGE instance. Here, `` is the memory requirement per core and `` is either given as the number of seconds (14400 for four hours, default value) or as hours (12:00:00 for compute-counts as it sometimes may take more than four hours). If your UGE instance does not accept the parameters `m_mem_free` and `h_rt` or requires other parameters (e.g. a queue name), the recommendation is to edit the script `process-samples.sh` and to adapt the calls to `qsub` in the function `submitJob` which is the only place where `qsub` jobs are actually submitted. Testing EQP-cluster ------------------- In order to test EQP just unpack the file `samples.tgz`: `tar xfz samples.tgz` Then go to the `bin` directory, and execute run `run-test.sh`: `cd bin` `run-test.sh` This executes the sequence of `split`, `align-all`, `compute-counts`, and `combine-counts` for the two samples `sample1` and `sample2` (which are contained in the `samples` directory) using the option `-noqsub`. If you want to test whether the submission of jobs via UGE works, just execute the four commands `process-samples.sh split` `process-samples.sh align-all` `process-samples.sh compute-counts` `process-samples.sh combine-counts` as described above and compare the newly computed counts in the directory `samples/count-files` with the previously computed counts in the directory `samples/comparison-counts` by calling (in the `samples` subdirectory): `run-test.sh compare-only` License ======= Copyright 2015 Novartis Institutes for Biomedical Research Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at `http://www.apache.org/licenses/LICENSE-2.0` Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.