databio / bulker

Manager for multi-container computing environments
https://bulker.io
BSD 2-Clause "Simplified" License
24 stars 2 forks source link

Bulker + CWL synergy and enhancement #63

Open nsheff opened 4 years ago

nsheff commented 4 years ago

I've been working on some updates to interface Bulker with CWL. Here are some notes and brainstorming about it.

Motivation

Bulker is very nice for making an interactive environment where a user runs tools as if they are native, but the actually run in containers. This makes a nice, portable and sharable environment that is reproducible across computing hardware, and also across container engines (docker and singularity). It's really useful for interactivity, and I'm unaware of any other tool that does something like this.

I envisioned that these interactive environments could also be useful to containerize a workflow. For example, a bulker environment can make a native workflow immediately containerized, which simplifies the process of containerization for a workflow author. The workflow just needs to be written using native tools, but then run inside an active -- and ideally strict -- bulker environment, and you get containerization and reproducibility for free.

But existing tools like CWL also have built in container management that people are using to containerize workflows. In CWL, individual tool files can specify the images they use to run. These images are then used within the CWL engine to make the workflow containerized, which is sort of fulfilling the same role that the bulker environments could fulfill. In the CWL approach, a tool definition is tightly coupled to its container. In the bulker approach, they are separated.

Some advantages of the CWL tight coupling approach are:

Some advantages of the decoupling are:

Synergy

One way to promote a connection and possibly get benefits of both methods is to make it easy to convert back-and-forth. To do that we'd need to enable two directions:

From a CWL to a bulker environment

If you could take a CWL and get an interactive environment, that would be useful. So, I've now implemented this in the cwl2man (CWL -> Bulker manifest) command in bulker. Given a list of CWL tool descriptions, bulker can create a manifest that can then be used interactively.

It works like this: I cloned bio-cwl-tools and built a bulker manifest, so we can create an interactive environment representing that repository. It works like this:

bulker cwl2man -f bio-cwl-manifest.yaml -c `ls */*.cwl`
bulker load cwl/test -f bio-cwl-manifest.yaml
bulker activate cwl/test
samtools

It was pretty simple on the surface. Will likely run into some details that need to be solved, but for now, it worked for some basic stuff.

From a bulker environment to a CWL workflow

Given a bulker environment, I could take a CWL workflow and update the containers to match the bulker environment. This would make it pretty simple to write a non-container-aware CWL workflow, and then just immediately containerize it. Haven't implemented this, but you'd do something like:

bulker containerize cwl/test -w workflow.cwl

A set of common tool descriptions

A useful thing for a CWL developer is to have a set of ready-made tool descriptions, and this is the goal behind the bio-cwl-tools repository: "to collect and collaboratively maintain CWL CommandLineTool descriptions of any biology/life-sciences related applications."

This is in fact not too different from a bulker manifest, really -- with the difference that the manifest is only about images, not about interfaces, whereas the CWL descriptions do both; and the manifests are version controlled as a collection, and hosted via bulker hub. But perhaps these two ideas can synergize into one: A centrally located collection of bioinformatics tools that is version controlled as a collection, and available as both a bulker manifest and as CWL tool descriptions. This way, someone could use such as set interactively with bulker, or as a tool description resource for building CWL workflows, which would then be tied to specific bulker environment versions.