analysis-workflows

Overview

The McDonnell Genome Institute (MGI) and contributing staff, faculty, labs and departments of Washington University School of Medicine (WUSM) share Common Workflow Language (CWL) workflow definitions focused on reusable, reproducible analysis pipelines for genomics data.

Structure

The main structure of this repo is described in the following table:

Path	Description
definitions	parent directory containing all CWL tool and workflow definitions
definitions/pipelines	all workflows which rely on subworkflows and tools to produce final outputs
definitions/subworkflows	workflows that combine multiple tools to produce intermediate (used as inputs to other subworkflows) pipeline outputs
definitions/tools	CWL that wrap command line interfaces or scripts connecting multiple tools
definitions/types	custom CWL data types for inputs to tools and workflows
example_data	example input data, input YAML files, and expected output files for testing

Documentation

All documentation of CWL pipelines, subworkflows, and tools as well as additional information regarding test data, continous integration, and configuration can be found on the GitHub wiki: https://github.com/genome/analysis-workflows/wiki

Quick Start

Workflows

Download our repository with git clone https://github.com/genome/analysis-workflows.git

The official CWL user guide covers the basics of reading and writing CWL files, constructing input files, and running workflows.

Workflow Execution Service

These workflow definitions are built for interoperability with any Workflow Execution Service (WES) schema compatible implementation that supports CWL.

Each CWL file is validated using cwltool. Additional workflow definition testing is performed with Cromwell. However, currently there are no automated workflow tests using Cromwell.

Docker

In order to provide a portable environment, each tool in our workflow has a designated Docker container. Download Docker here.

All MGI supported Docker images used in the tool workflow definitions are available on mgibio DockerHub.

Many tools rely on third-party Docker images publicly available from sources such as Docker Hub and BioContainers.

Data

Full reference data is documented and available for download on the wiki Coming soon

Example data, packaged together with fully populated yamls corresponding to top level workflows in this repo's definitions/pipelines directory, can be found on our public GCP bucket. To download this package, use our helper docker container: docker run -v <desired_absolute_path>:/staging mgibio/data_downloader:0.1.0 gsutil -m cp -r gs://analysis-workflows-example-data /staging

Note: We are currently migrating and updating our example data. Files within the example_data directory of this repository are no longer fully supported, and some are out of date. Moving forward, all data will be hosted in GCP. The instructions above currently download the full, uncompressed example data set (~800 mb). More granular, compressed downloads are upcoming. Advanced users may explore the bucket structure and download individual files using wget https://storage.googleapis.com/analysis-workflows-example-data/[path_to_file] (omitting path_to_file will download a manifest describing the directory structure).

Contributions

A big thanks to all of the developers, bioinformaticians and scientists who built this resource. For a complete list of software contributions, i.e. commits, to this repository, please see the GitHub Contributors section.

Collaborators

The following WUSM collaborators have provided significant contributions in terms of workflow design, scientific direction, and validation of analysis-workflows output.

genome / analysis-workflows

readme