bjpop / rubra

Infrastructure code to support DNA pipeline
MIT License
38 stars 18 forks source link

Rubra: a bioinformatics pipeline.

https://github.com/bjpop/rubra

License:

Rubra is licensed under the MIT license. See LICENSE.txt.

Description:

Rubra is a pipeline system for bioinformatics workflows. It is built on top of the Ruffus (http://www.ruffus.org.uk/) Python library, and adds support for running pipeline stages on a distributed compute cluster.

Authors:

Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield

Installation:

To install, clone this repository and run setup.py:

git clone https://github.com/bjpop/rubra
cd rubra
python setup.py install

If you are on a system where you do not have administrative privileges, we suggest using virtualenv ( http://www.virtualenv.org/ ). On HPC systems you may find virtualenv is already installed.

Usage:

usage: rubra [-h] PIPELINE_FILE --config CONFIG_FILE [CONFIG_FILE ...] [--verbose {0,1,2}] [--style {print,run,touchfiles,flowchart}] [--force TASKNAME] [--end TASKNAME] [--rebuild {fromstart,fromend}]

A bioinformatics pipeline system.

optional arguments: -h, --help show this help message and exit PIPELINE_FILE Your Ruffus pipeline stages (a Python module) --config CONFIG_FILE [CONFIG_FILE ...] One or more configuration files (Python modules) --verbose {0,1,2} Output verbosity level: 0 = quiet; 1 = normal; 2 = chatty (default is 1) --style {print,run,touchfiles,flowchart} Pipeline behaviour: print; run; touchfiles; flowchart (default is print) --force TASKNAME tasks which are forced to be out of date regardless of timestamps --end TASKNAME end points (tasks) for the pipeline --rebuild {fromstart,fromend} rebuild outputs by working back from end tasks or forwards from start tasks (default is fromstart)

Example:

Below is a little example pipeline which you can find in the Rubra source tree. It counts the number of lines in two files (test/data1.txt and test/data2.txt), and then sums the results together.

rubra example_pipeline.py --config example_config.py --style run

There are 2 lines in the first file and 1 line in the second file. So the result is 3, which is written to the output file test/total.txt.

The --pipeline argument is a Python script which contains the actual code for each pipeline stage (using Ruffus notation). The --config argument is a Python script which contains configuration options for the whole pipeline, plus options for each stage (including the shell command to run in the stage). The --style argument says what to do with the pipeline: "run" means "perform the out-of-date steps in the pipeline". The default style is "print" which just displays what the pipeline would do if it were run. You can get a diagram of the pipeline using the "flowchart" style. You can touch all files in order using the "touchfiles" style, which is mostly useful for forcing Ruffus to acknowledge that a set of steps is up to date.

Configuration:

Configuration options are written into one or more Python scripts, which are passed to Rubra via the --config command line argument.

Some options are required, and some are, well, optional.

Options for the whole pipeline:

pipeline = {
    "logDir": "log",
    "logFile": "pipeline.log",
    "procs": 2,
    "end": ["total"],
}

Options for each stage of the pipeline:

stageDefaults = {
    "distributed": False,
    "walltime": "00:10:00",
    "memInGB": 1,
    "queue": "batch",
    "modules": ["python-gcc"]
}

stages = {
    "countLines": {
        "command": "wc -l %file > %out",
    },
    "total": {
        "command": "./test/total.py %files > %out",
    },
}