I've been working on some updates to interface Bulker with CWL. Here are some notes and brainstorming about it.
Motivation
Bulker is very nice for making an interactive environment where a user runs tools as if they are native, but the actually run in containers. This makes a nice, portable and sharable environment that is reproducible across computing hardware, and also across container engines (docker and singularity). It's really useful for interactivity, and I'm unaware of any other tool that does something like this.
I envisioned that these interactive environments could also be useful to containerize a workflow. For example, a bulker environment can make a native workflow immediately containerized, which simplifies the process of containerization for a workflow author. The workflow just needs to be written using native tools, but then run inside an active -- and ideally strict -- bulker environment, and you get containerization and reproducibility for free.
But existing tools like CWL also have built in container management that people are using to containerize workflows. In CWL, individual tool files can specify the images they use to run. These images are then used within the CWL engine to make the workflow containerized, which is sort of fulfilling the same role that the bulker environments could fulfill. In the CWL approach, a tool definition is tightly coupled to its container. In the bulker approach, they are separated.
Some advantages of the CWL tight coupling approach are:
There may be useful information in the connection from the task to the container, which bulker intentionally severs. You could perhaps reconstruct this information from a bulker environment, but with CWL it is hard-coded and immediately there.
You could use two separate versions of a tool in one workflow. With bulker, you could do this, but you'd have to name the command something different, since it maps commands to images, rather than each specific invocation of a tool.
there's already an infrastructure built up around this model, so there's intertia behind the idea of coupling a task tightly to its image
if you hand the workflow off to someone else, you have some guarantee that it will work, because they can't change the tools. with Bulker, could just give them both the workflow and the environment, but because they could change the environment, they could introduce their own environment that would break the workflow.
others?
Some advantages of the decoupling are:
the environment is independent of the workflow; therefore, it can be re-used across multiple workflows
environments can be used interactively, for debugging, development, or just for everyday computing -- for example, this can supplant the need for environment modules systems, syncing environments between remote and local compute. So, the environments transcend use in workflows only
it's easier to update/change environments; for instance, if I have a workflow and I want to upgrade all the tools, I can just use an updated bulker environment. Probably I'm version-controlling the bulker manifest anyway, so this this sort of happens automatically, reducing long-term maintenance of the workflow
workflow authors don't need to care or even be aware of containers or how they work; it's completely outsourced so they can focus on the workflow itself.
others?
Synergy
One way to promote a connection and possibly get benefits of both methods is to make it easy to convert back-and-forth. To do that we'd need to enable two directions:
From a CWL to a bulker environment
If you could take a CWL and get an interactive environment, that would be useful. So, I've now implemented this in the cwl2man (CWL -> Bulker manifest) command in bulker. Given a list of CWL tool descriptions, bulker can create a manifest that can then be used interactively.
It works like this: I cloned bio-cwl-tools and built a bulker manifest, so we can create an interactive environment representing that repository. It works like this:
It was pretty simple on the surface. Will likely run into some details that need to be solved, but for now, it worked for some basic stuff.
From a bulker environment to a CWL workflow
Given a bulker environment, I could take a CWL workflow and update the containers to match the bulker environment. This would make it pretty simple to write a non-container-aware CWL workflow, and then just immediately containerize it. Haven't implemented this, but you'd do something like:
bulker containerize cwl/test -w workflow.cwl
A set of common tool descriptions
A useful thing for a CWL developer is to have a set of ready-made tool descriptions, and this is the goal behind the bio-cwl-tools repository: "to collect and collaboratively maintain CWL CommandLineTool descriptions of any biology/life-sciences related applications."
This is in fact not too different from a bulker manifest, really -- with the difference that the manifest is only about images, not about interfaces, whereas the CWL descriptions do both; and the manifests are version controlled as a collection, and hosted via bulker hub. But perhaps these two ideas can synergize into one: A centrally located collection of bioinformatics tools that is version controlled as a collection, and available as both a bulker manifest and as CWL tool descriptions. This way, someone could use such as set interactively with bulker, or as a tool description resource for building CWL workflows, which would then be tied to specific bulker environment versions.
I've been working on some updates to interface Bulker with CWL. Here are some notes and brainstorming about it.
Motivation
Bulker is very nice for making an interactive environment where a user runs tools as if they are native, but the actually run in containers. This makes a nice, portable and sharable environment that is reproducible across computing hardware, and also across container engines (docker and singularity). It's really useful for interactivity, and I'm unaware of any other tool that does something like this.
I envisioned that these interactive environments could also be useful to containerize a workflow. For example, a bulker environment can make a native workflow immediately containerized, which simplifies the process of containerization for a workflow author. The workflow just needs to be written using native tools, but then run inside an active -- and ideally strict -- bulker environment, and you get containerization and reproducibility for free.
But existing tools like CWL also have built in container management that people are using to containerize workflows. In CWL, individual tool files can specify the images they use to run. These images are then used within the CWL engine to make the workflow containerized, which is sort of fulfilling the same role that the bulker environments could fulfill. In the CWL approach, a tool definition is tightly coupled to its container. In the bulker approach, they are separated.
Some advantages of the CWL tight coupling approach are:
Some advantages of the decoupling are:
Synergy
One way to promote a connection and possibly get benefits of both methods is to make it easy to convert back-and-forth. To do that we'd need to enable two directions:
From a CWL to a bulker environment
If you could take a CWL and get an interactive environment, that would be useful. So, I've now implemented this in the
cwl2man
(CWL -> Bulker manifest) command in bulker. Given a list of CWL tool descriptions, bulker can create a manifest that can then be used interactively.It works like this: I cloned bio-cwl-tools and built a bulker manifest, so we can create an interactive environment representing that repository. It works like this:
It was pretty simple on the surface. Will likely run into some details that need to be solved, but for now, it worked for some basic stuff.
From a bulker environment to a CWL workflow
Given a bulker environment, I could take a CWL workflow and update the containers to match the bulker environment. This would make it pretty simple to write a non-container-aware CWL workflow, and then just immediately containerize it. Haven't implemented this, but you'd do something like:
A set of common tool descriptions
A useful thing for a CWL developer is to have a set of ready-made tool descriptions, and this is the goal behind the bio-cwl-tools repository: "to collect and collaboratively maintain CWL CommandLineTool descriptions of any biology/life-sciences related applications."
This is in fact not too different from a bulker manifest, really -- with the difference that the manifest is only about images, not about interfaces, whereas the CWL descriptions do both; and the manifests are version controlled as a collection, and hosted via bulker hub. But perhaps these two ideas can synergize into one: A centrally located collection of bioinformatics tools that is version controlled as a collection, and available as both a bulker manifest and as CWL tool descriptions. This way, someone could use such as set interactively with bulker, or as a tool description resource for building CWL workflows, which would then be tied to specific bulker environment versions.