Towards integrating the Common Workflow Language with bcbio

kern3020 commented 9 years ago

This issue is simply intended to be a long running ticket where we can log correspondence pertinent to integrating the Common Workflow Language with bcbio.

Brad, after you create a stable docker image, would you post the link here?

kern3020 commented 9 years ago

For convenience, logging Peter's original email announcing this integration here:

Subject: bcbio-nextgen on common workflow language Date: Tue, 13 Jan 2015 13:28:40 -0500

Hi John,

On the Common Workflow Language call today you mentioned that you were interested in how bcbio-nextgen might take advantage of the Common Workflow Language. You may not be aware, but I have already been talking to Brad Chapman about this, but project is just getting started. Here's the general approach to get to proof-of-concept:

Write CWL tool description wrappers for several bcbio tools (enough to be able to port one of the small pipelines such as StandardPipeline). Invoke tools using "bcbio_nextgen.py runfn [tool] [config]" Currently we need to add some features to CWL Tool description language to support bcbio. In particular, need to add support for configuration file templates (prototype template feature available at https://github.com/common-workflow-language/common-workflow-language/tree/underscore)
Port the bcbio pipeline to CWL workflow and validate with reference implementation. Draft CWL workflow spec/reference implementation doesn't exist yet (although prototypes are floating around), so we (the CWL community) needs to push forward on that.

If this is something you are interested in, we should talk.

Thanks, Peter

kern3020 commented 9 years ago

Hello,

After re-reading Peter’s email above and reviewing some pertinent bits of the code, I have a question or two.

In addition making calls to 3rd party bioinformatic programs, we make calls to standard Linux commands. POSIX defines many of these commands. If a particular command is in question, see http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html#tag_20_144
In addition to wrapping well-known commands and programs, the bcbio code which the bcbio pipeline calls (eg, bwa.py) would need to be wrapped in the CWL. Further the pipeline proper would be removed in favor of a well-known platform (e.g, Avrados, Galaxy, etc) in the end. Is that right?
With Draft one there is a schema and a job order. Would the platform define the schema for well-known bioinformatic (eg, bwa, gatk, etc) and POSX command (e.g., sort, uniq, awk)?
I’m assuming bcbio would define the schema for the moral equivalent of bwa.py and friends, right?

-jk

chapmanb commented 9 years ago

John; Thanks so much for looking at this. Here's a docker build we can use to work off of, and I'll update it on major releases:

https://s3.amazonaws.com/bcbio/docker/bcbio-stable.gz

For your questions, the general idea with bcbio is that you wouldn't wrap individual programs but rather bcbio functions. So you'd be calling things like prep_align_inputs with the data object that contains information about a sample, essentially replicating the high level logic that hapens in the main functions (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/pipeline/main.py#L145). All of the command line calls then happen within Docker, driven by bcbio, so those don't have to be exposed to CWL. CWL just knows about bcbio and how all the top level functionality runs without needing to know the details of uniq and bwa and other tools. You're right on that CWL would drive the running (though Arvados or Galaxy) and we would not longer need the main functionality in bcbio.

Thanks again for all this.

kern3020 commented 9 years ago

Thanks for the feedback Brad. Now I understand the concern you raised about the data object.

The bcbio’s data object is created from two configuration files: bcbio_sample.yaml and bcbio_system.yaml. What needs to be exposed to the underlying platform? Below I have categorized the information in the configuration files to better understand the interface between bcbio and the underlying platform.

_bcbiosample.yaml is composed of

Algorithms and their parameters. These should be encapsulated with the bcbio container.
I/O information - This will need to be exposed and controlled by the platform.
UI information - Which attributes contain pertinent information for the UI? These will need to be exposed to platform. Some of these are relevant to specific tools while others are global. As the workflow spec is fleshed out, I think this split will become clear.

_bcbiosystem.yaml is composed of

Commands and arguments. Since these commands are part of a bcbio container, the underlying platform doesn’t need to know about them.
Computational resources. The underlying platform needs to know about the memory requirements and number of cores.

Does this sound roughly right or do you see it differently?

-jk

kern3020 commented 9 years ago

Hello @chapmanb,

I am looking at the related bcbio-nextgen-vm project. I want gain a deeper understanding of the mount points. Currently, the number and nature of various mount points for each docker container is an implement detail of bcbio. With a version of bcbio running on a cwl compliant platform, we would need explicit specific these mount points as the platform will be establishing them or some moral equivalent. Could you speak to the required mount points? This would be a big help with respect to establish an interface between bcbio and a cwl compliant platform.

Thanks, John

chapmanb commented 9 years ago

John; Thanks again for looking at this and sorry for the delaying in responding. Having specific input file mount points for integration with tools like Arvados is definitely something we'd like to have. I've done some initial work on this testing with push/pull from S3. The approach is to annotate the world data object with keys required for processing. They get added to the fresources (file resources) key in the data dictionary:

https://github.com/chapmanb/bcbio-nextgen/blob/02f56ca9d555c2dc9c664d9aa065830f3a0a6321/bcbio/pipeline/main.py#L138

Then this is a proof of concept implementation that uses these to transfer down files from S3:

https://github.com/chapmanb/bcbio-nextgen-vm/blob/6ff189eddc3f6a3e294eb101601768545cc2d560/bcbiovm/ship/reconstitute.py#L75

So the general idea is to do this from the data input dynamically, since files will be processed/added as part of some of the initial processing steps. The implementation is not fully complete but at least the ideas are there.

Practically, could we get away with docker + a shared filesystem as a first pass with CWL and just pass around the data dictionary as the dependency? I know we lose a lot of the file provenance but it at least gets us a smoother path towards a working implementation that we could then add the file provenance onto.

Thanks again for all the help looking at this.

kern3020 commented 9 years ago

Hello Brad,

“Practically, could we get away with docker + a shared filesystem as a first pass with CWL and just pass around the data dictionary as the dependency?”

I don’t know but I look into it. With bcbio’s current implementation, the number of docker images and associated filesystems is fixed.

PaaS systems use containers too. Often they do not allow the application to access the filesystem (apps use a database). This allows the platform some administrative flexibility(ie, shut the app down, move it, add more resources, etc). I have no idea if this applies the cwl compliant platforms we are talking about, but I will investigate.

I haven’t completed my due diligence on wrappers for the Standard Pipeline. So no doubt this is a technically a bit off with respect to the number and nature of the desired wrappers. So please bare with me. I want to shed light on the interaction between bcbio and a cwl platform. We plan to write wrappers for pertinent function in the pipeline. As a rough first pass, Bcbio’s Standard Pipeline is composed of:

"organize_samples"
"process_alignment"
"prep_samples"
"postprocess_alignment"
"combine_sample_regions"
qcsummary - not called by run_parallel

What if it made sense to write a wrapper for each? The platform will call ‘docker run’ for each of these functions. There is no guarantee on which host or filesystem this happens. Let’s consider a worst case scenario, in which, each docker call happens on a unique host/filesystem combination.

Draft-1 ensures that there is a current working directory(cwd) and a tmp directory to which the tool can write. I need to confirm but it is my understand the cwd will be different each time for this scenario.

Draft-1 of the cwl spec is very specific about input and output.

https://github.com/common-workflow-language/common-workflow-language/blob/master/specification/tool-description.md#input-schema

Clearly, it is a file based. Looking at the architecture for Arvados, I image this is important. Allowing the platform to store the output in their data store(the Keep) and retrieve it via a hash code.

@tetron, would you comment, confirm or clarify as needed?

tetron commented 9 years ago

Hi John, Brad,

What if it made sense to write a wrapper for each? The platform will call ‘docker run’ for each of these > functions. There is no guarantee on which host or filesystem this happens. Let’s consider a worst case > scenario, in which, each docker call happens on a unique host/filesystem combination.

This fits with the baseline assumption for CWL. The goal is to minimize the requirements on the host platform so that different PaaS providers can handle it different ways. In particular, distributed filesystems such as Keep, Amazon S3 buckets, etc are poorly suited to having multiple writers to the same directory tree, so shared filesystems are out.

If you can identify all the fields that correspond to file or directory paths within the bcbio "data" input object using the CWL "inputs" schema, then the CWL host can rewrite the paths to the right place on the host. This is similar to the _unpack_s3() method which rewrites file references to s3 to local references, except that it would happen outside of bcbio.

kern3020 commented 9 years ago

Logging correspondence:

Hello Brad(@chapmanb) & Peter(@tetron),

Would it be prudent to meetup via hangout again to hash out some issues posed by integrating bcbio with CWL?

The initial integration point for cwl/bcbio is the 'bcbio_vm.py runfn' command. It takes several position arguments. For this discussion, the runargs argument is pivotal. It represents the data world object.

Understanding the life cycle of the data world object is critical. With respect to the information in the data world object, sometimes this information will come from bcbio. Sometimes the platform must define this information. We'll need a versatile way to create the data world object. One which allows us to compose the data world object from both bcbio and platform.

Currently, the data world object is as a file. Brad posted an example here: https://gist.github.com/chapmanb/45069087a549acbe6073. This really highlights why Brad wants the platform to simple pass in the directory. There are a dozen or more files dealing with input, configuration, output and reference data.

From talking with you two, some design guidelines come to mind are:

the platform must control all input and output files.
the platform controls resources (memory, cores) allocated.
The data world object is mission critically. Can we simply pass it around as is? Issue: Are there others you'd recommend?

I think we are going to need to add some kind of handshake between the platform and bcbio. For example,

the platform might tell bcbio to put reference data in a specific directory. Bcbio would reply with an explicit file containing a list of all reference files. Would we need to amend the tools spec to allow a directory?

Issue: I think the number and nature of this handshake would be a health place for a design discussion.

For me, the purpose of this meeting is simply to ensure I'm going in the right direction.

Thanks for your time and consideration.

Regards, -jk

kern3020 commented 9 years ago

Hello Brad(@chapmanb),

I have been reviewing the source to understand how the data world object is created. It appears that there is no single function which creates the data world object. Instead it is gradually build up. Do I understand it correctly?

-jk

kern3020 commented 9 years ago

Hello,

What do you think about adding a new option to create the data world object?

$ bcbio_nextgen.py ioc_data_constructor sample.yaml system.yaml

It will be given the two standard yaml files (sample and system) and return a file containing a valid data world object. ‘ioc’ stands for inversion of control. I wanted a prefix to key users into the fact this is a specialized command. Users of CWL, testers and possibility other will read the documentation and know these are for them. All others can safely ignore them.

-jk

mr-c commented 8 years ago

Rumour has it that bcbio-nextgen has complete CWL support. Can this be verified here?

chapmanb commented 8 years ago

Michael; Thanks for bumping this thread. We have been working on bcbio CWL. While not complete, we can run parallel alignment and variant calling pipelines with bcbio both locally with cwltool and on Arvados. This is still a work in progress to test scaling and get full support but documentation is here:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/cwl.html

and here's a test run on Arvados:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0fkncxo7asjw3jh

I'll close this issue for now as I think we've got the initial steps in place for supporting CWL and now need to focus on expanding and testing this support. John, thanks for all the help and discussion here and happy to catch you up if you have free cycles to do bcbio work in the future.

bcbio / bcbio-nextgen

Towards integrating the Common Workflow Language with bcbio #725