Closed kern3020 closed 8 years ago
For convenience, logging Peter's original email announcing this integration here:
Subject: bcbio-nextgen on common workflow language Date: Tue, 13 Jan 2015 13:28:40 -0500
Hi John,
On the Common Workflow Language call today you mentioned that you were interested in how bcbio-nextgen might take advantage of the Common Workflow Language. You may not be aware, but I have already been talking to Brad Chapman about this, but project is just getting started. Here's the general approach to get to proof-of-concept:
If this is something you are interested in, we should talk.
Thanks, Peter
Hello,
After re-reading Peter’s email above and reviewing some pertinent bits of the code, I have a question or two.
-jk
John; Thanks so much for looking at this. Here's a docker build we can use to work off of, and I'll update it on major releases:
https://s3.amazonaws.com/bcbio/docker/bcbio-stable.gz
For your questions, the general idea with bcbio is that you wouldn't wrap individual programs but rather bcbio functions. So you'd be calling things like prep_align_inputs
with the data
object that contains information about a sample, essentially replicating the high level logic that hapens in the main functions (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/pipeline/main.py#L145). All of the command line calls then happen within Docker, driven by bcbio, so those don't have to be exposed to CWL. CWL just knows about bcbio and how all the top level functionality runs without needing to know the details of uniq
and bwa
and other tools. You're right on that CWL would drive the running (though Arvados or Galaxy) and we would not longer need the main functionality in bcbio.
Thanks again for all this.
Thanks for the feedback Brad. Now I understand the concern you raised about the data object.
The bcbio’s data object is created from two configuration files: bcbio_sample.yaml and bcbio_system.yaml. What needs to be exposed to the underlying platform? Below I have categorized the information in the configuration files to better understand the interface between bcbio and the underlying platform.
_bcbiosample.yaml is composed of
_bcbiosystem.yaml is composed of
Does this sound roughly right or do you see it differently?
-jk
Hello @chapmanb,
I am looking at the related bcbio-nextgen-vm project. I want gain a deeper understanding of the mount points. Currently, the number and nature of various mount points for each docker container is an implement detail of bcbio. With a version of bcbio running on a cwl compliant platform, we would need explicit specific these mount points as the platform will be establishing them or some moral equivalent. Could you speak to the required mount points? This would be a big help with respect to establish an interface between bcbio and a cwl compliant platform.
Thanks, John
John;
Thanks again for looking at this and sorry for the delaying in responding. Having specific input file mount points for integration with tools like Arvados is definitely something we'd like to have. I've done some initial work on this testing with push/pull from S3. The approach is to annotate the world data
object with keys required for processing. They get added to the fresources
(file resources) key in the data dictionary:
Then this is a proof of concept implementation that uses these to transfer down files from S3:
So the general idea is to do this from the data
input dynamically, since files will be processed/added as part of some of the initial processing steps. The implementation is not fully complete but at least the ideas are there.
Practically, could we get away with docker + a shared filesystem as a first pass with CWL and just pass around the data
dictionary as the dependency? I know we lose a lot of the file provenance but it at least gets us a smoother path towards a working implementation that we could then add the file provenance onto.
Thanks again for all the help looking at this.
Hello Brad,
“Practically, could we get away with docker + a shared filesystem as a first pass with CWL and just pass around the data dictionary as the dependency?”
I don’t know but I look into it. With bcbio’s current implementation, the number of docker images and associated filesystems is fixed.
PaaS systems use containers too. Often they do not allow the application to access the filesystem (apps use a database). This allows the platform some administrative flexibility(ie, shut the app down, move it, add more resources, etc). I have no idea if this applies the cwl compliant platforms we are talking about, but I will investigate.
I haven’t completed my due diligence on wrappers for the Standard Pipeline. So no doubt this is a technically a bit off with respect to the number and nature of the desired wrappers. So please bare with me. I want to shed light on the interaction between bcbio and a cwl platform. We plan to write wrappers for pertinent function in the pipeline. As a rough first pass, Bcbio’s Standard Pipeline is composed of:
What if it made sense to write a wrapper for each? The platform will call ‘docker run’ for each of these functions. There is no guarantee on which host or filesystem this happens. Let’s consider a worst case scenario, in which, each docker call happens on a unique host/filesystem combination.
Draft-1 ensures that there is a current working directory(cwd) and a tmp directory to which the tool can write. I need to confirm but it is my understand the cwd will be different each time for this scenario.
Draft-1 of the cwl spec is very specific about input and output.
Clearly, it is a file based. Looking at the architecture for Arvados, I image this is important. Allowing the platform to store the output in their data store(the Keep) and retrieve it via a hash code.
@tetron, would you comment, confirm or clarify as needed?
Hi John, Brad,
What if it made sense to write a wrapper for each? The platform will call ‘docker run’ for each of these > functions. There is no guarantee on which host or filesystem this happens. Let’s consider a worst case > scenario, in which, each docker call happens on a unique host/filesystem combination.
This fits with the baseline assumption for CWL. The goal is to minimize the requirements on the host platform so that different PaaS providers can handle it different ways. In particular, distributed filesystems such as Keep, Amazon S3 buckets, etc are poorly suited to having multiple writers to the same directory tree, so shared filesystems are out.
If you can identify all the fields that correspond to file or directory paths within the bcbio "data" input object using the CWL "inputs" schema, then the CWL host can rewrite the paths to the right place on the host. This is similar to the _unpack_s3() method which rewrites file references to s3 to local references, except that it would happen outside of bcbio.
Logging correspondence:
Hello Brad(@chapmanb) & Peter(@tetron),
Would it be prudent to meetup via hangout again to hash out some issues posed by integrating bcbio with CWL?
The initial integration point for cwl/bcbio is the 'bcbio_vm.py runfn' command. It takes several position arguments. For this discussion, the runargs argument is pivotal. It represents the data world object.
Understanding the life cycle of the data world object is critical. With respect to the information in the data world object, sometimes this information will come from bcbio. Sometimes the platform must define this information. We'll need a versatile way to create the data world object. One which allows us to compose the data world object from both bcbio and platform.
Currently, the data world object is as a file. Brad posted an example here: https://gist.github.com/chapmanb/45069087a549acbe6073. This really highlights why Brad wants the platform to simple pass in the directory. There are a dozen or more files dealing with input, configuration, output and reference data.
From talking with you two, some design guidelines come to mind are:
I think we are going to need to add some kind of handshake between the platform and bcbio. For example,
Issue: I think the number and nature of this handshake would be a health place for a design discussion.
For me, the purpose of this meeting is simply to ensure I'm going in the right direction.
Thanks for your time and consideration.
Regards, -jk
Hello Brad(@chapmanb),
I have been reviewing the source to understand how the data world object is created. It appears that there is no single function which creates the data world object. Instead it is gradually build up. Do I understand it correctly?
-jk
Hello,
What do you think about adding a new option to create the data world object?
$ bcbio_nextgen.py ioc_data_constructor sample.yaml system.yaml
It will be given the two standard yaml files (sample and system) and return a file containing a valid data world object. ‘ioc’ stands for inversion of control. I wanted a prefix to key users into the fact this is a specialized command. Users of CWL, testers and possibility other will read the documentation and know these are for them. All others can safely ignore them.
-jk
Rumour has it that bcbio-nextgen has complete CWL support. Can this be verified here?
Michael; Thanks for bumping this thread. We have been working on bcbio CWL. While not complete, we can run parallel alignment and variant calling pipelines with bcbio both locally with cwltool and on Arvados. This is still a work in progress to test scaling and get full support but documentation is here:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/cwl.html
and here's a test run on Arvados:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0fkncxo7asjw3jh
I'll close this issue for now as I think we've got the initial steps in place for supporting CWL and now need to focus on expanding and testing this support. John, thanks for all the help and discussion here and happy to catch you up if you have free cycles to do bcbio work in the future.
This issue is simply intended to be a long running ticket where we can log correspondence pertinent to integrating the Common Workflow Language with bcbio.
Brad, after you create a stable docker image, would you post the link here?