Mizzou-CBMI / COSMOS2

Python Scientific Pipeline Management System
GNU General Public License v3.0
71 stars 39 forks source link

.. image:: https://travis-ci.org/Mizzou-CBMI/COSMOS2.svg?branch=master :target: https://travis-ci.org/Mizzou-CBMI/COSMOS2

Documentation

http://mizzou-cbmi.github.io/COSMOS2/ <http://mizzou-cbmi.github.io/COSMOS2/>_

Install

From pip:

.. code-block:: python

pip install cosmos-wfm

# Optional, recommended for visualizing Workflows:
sudo apt-get graphviz graphviz-dev  # or brew install graphviz for mac
pip install pygraphviz # requires graphviz

From conda:

.. code-block:: python

conda install cosmos-wfm -c ravelbio

Introduction

Cosmos is a python library for creating scientific pipelines that run on a distributed computing cluster. It is primarily designed and used for machine learning and bioinformatics pipelines, but is general enough for any type of distributed computing workflow and is also used in fields such as image processing.

Cosmos provides a simple python api to specify any job DAG using simple python code making it extremely flexible and intuitive

Cosmos allows you to resume modified or failed workflows, uses SQL to store job information, and provides a web dashboard for monitoring and debugging. It is different from libraries such as Luigi <https://github.com/spotify/luigi> or Airflow <http://airbnb.io/projects/airflow/> which also try to solve ETL problems such as scheduling recurring tasks and listening for events.

Cosmos is very focused on reproducible scientific pipelines, allowing it to have a very simple state. There is a single process per Workflow which is a python script, and a single process per Task which is python function represented by an executable script. When a Task fails, reproducing the exact environment of a Task is as simple as re-running the command script. Since the command script is a python script, you can also launch it with pdb (python -m ipdb log/stage/uid/command_attempt).

The same pipeline can also easily be run on a variety of compute infrastructure: locally, in the cloud, or on a grid computing cluster.

Cosmos is intended and useful for both one-off analyses and production software. Users have analyzed >100 whole genomes (~50TB and tens of thousands of jobs) in a single Workflow without issue, and some of the largest clinical sequencing laboratories use it for the production and R&D workflows. We routinely use it to run workflows consisting of 10s of thousands of Machine Learning jobs.

AWS Batch


We've been using quite a bit of AWS Batch for the past year, and this is by far the most developed and supported DRM. It's pretty hard to continue to support DRMs that we're not using in our day-to-day. That is mostly left to the community using Cosmos. Support for a DRM is contained in a single class that people often tweak for their particular distributed computing environment, see the classes in cosmos/job/drm, the interface only has a handful of methods that must be implemented.

Make sure to check out examples/ex_awsbatch.py for details about how to use the AWS Batch DRM. Jobs submit and terminate much faster than any other DRM. This is a great way to utilize cheap AWS spot instances for your workflows for both machine learning and bioinformatics workflows. Cosmos will automatically resubmit jobs that fail due to a spot-instance termination.

History


Cosmos was published as an Application Note in the journal Bioinformatics <http://bioinformatics.oxfordjournals.org/>, but has evolved a lot since its original inception. If you use Cosmos for research, please cite its manuscript <http://bioinformatics.oxfordjournals.org/content/early/2014/06/29/bioinformatics.btu385>.

Since the original publication, it has been re-written and open-sourced by the original author, in a collaboration between The Lab for Personalized Medicine <http://lpm.hms.harvard.edu/> at Harvard Medical School, the Wall Lab <http://wall-lab.stanford.edu/> at Stanford University, and Invitae <http://invitae.com>_. Invitae is a leading clinical genetic sequencing diagnostics laboratory where Cosmos is deployed in production and has processed hundreds of thousands of samples. It is also used by various research groups around the world; if you use it for cool stuff please let us know!

Features


Web Dashboard


.. figure:: docs/source/_static/imgs/web_interface.png :align: center

Multi-platform Support +++++++++++++++++++++++

Bug Reports


Please use the Github Issue Tracker <https://github.com/Mizzou-CBMI/Cosmos2/issues>_.

Testing


python setup.py test

.. code-block:: bash

py.test

Building Docs


In a python2.7 environment

.. code-block:: bash

pip install ghp-import sphinx sphinx_rtd_theme
cd docs
make html
cd build/html
ghp-import -n ./ -p

Building Conda Package


.. code-block:: bash

python devops.py release

rm -rf cosmos-wfm
conda skeleton pypi cosmos-wfm --version 2.13.4
conda build cosmos-wfm
anaconda upload /home/egafni/miniconda3/conda-bld/linux-64/cosmos-wfm-2.13.4-py38_0.tar.bz2 -u ravelbio

Cosmos Users


Please let us know if you're using Cosmos by sending a PR with your company or lab name and any relevant information.

Publications using Cosmos


1) Elshazly H, Souilmi Y, Tonellato PJ, Wall DP, Abouelhoda M (2017) MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinformatics, 18(1), 49.

2) Souilmi Y, Lancaster AK, Jung JY, Rizzo E, Hawkins JB, Powles R, Amzazi S, Ghazal H, Tonellato PJ, Wall DP (2015) Scalable and cost-effective NGS genotyping in the cloud. BMC Medical Genomics, 8(1), 64.

3) Souilmi Y., Jung J-Y., Lancaster AK, Gafni E., Amzazi S., Ghazal H., Wall DP., Tonellato, P. (2015). COSMOS: cloud enabled NGS analysis. BMC Bioinformatics, 16(Suppl 2), A2. doi: 10.1186/1471-2105- 16-S2- A2

4) Gafni E, Luquette LJ, Lancaster AK, Hawkins JB, Jung J-Y, Souilmi Y, Wall DP, Tonellato PJ: COSMOS: Python library for massively parallel workflows. Bioinformatics (2014) 30 (20): 2956-2958. doi: 10.1093/bioinformatics/btu385

5) Hawkins JB, Souilmi Y, Powles R, Jung JY, Wall DP, Tonellato PJ (2013) COSMOS: NGS Analysis in the Cloud. AMIA TBI. BMC Medical Genomics

Changelog


2.13.0 +++++++

SQL Column added! If you see this error:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: task.status_reason

It's because this new version of cosmos is not backwards compatible with these old databases. This can be easily fixed by migrating the old database

To use cosmos 2.13.0 on old databases, you must add this new column. Ex:

.. code-block:: bash

sqlite cosmos.sqlite
sqlite> alter table task add status_reason CHAR(255)

2.12.0 ++++++

2.11.0 ++++++++

2.5.1 ++++++

API Change!

2.5.0 ++++++

2.0.1 ++++++ Some pretty big changes here, incurred during a hackathon at Invitae where a lot of feedback and contributions were received. Primarily, the api was simplified and made more intuitive. A new Cosmos primitive was created called a Dependency, which we have found extremely useful for generalizing subworkflow recipes. This API is now considered to be much more stable.