bionode / bionode-watermill

💧Bionode-Watermill: A (Not Yet Streaming) Workflow Engine
https://bionode.gitbooks.io/bionode-watermill/content/
MIT License
37 stars 11 forks source link

Common Workflow Language (CWL) support? #45

Open olgabot opened 7 years ago

olgabot commented 7 years ago

Hello! This is a very interesting project. In case you haven't seen it, there's a project called Common Workflow Language (CWL) that attempts to create a single document specifying a pipeline workflow that can be parsed by a multitude of programs so your pipeline can be run portably on the cloud, a laptop, a server, etc.

Wanted to let you know about other people in the reproducibility space :)

thejmazz commented 7 years ago

hi! thanks for the feedback!

we are well aware of CWL, while this project was being developed (I had 12 weeks, 4 of which was deciding what to do..), we were looking into CWL, but the spec had not been finalized, and time was tight - it would have been too much for the MVP to use CWL.

now that the 1.0 is out, definitely going to spend time investigating how to integrate.

Some ideas:

I think CWL and watermill can work together, rather than be alternatives

They each have different objectives.

Watermill lets you orchestrate an entire pipeline composed of tasks from a high level, while CWL seems to be about strictly defining tasks that belong to some pipeline, strictly in the sense that filenames are "hardcoded".

Perhaps the logging output of a watermill pipeline could be a bunch of CWL files with absolute filenames baked in for example. One institution could run a pipeline on their cluster, then have its execution dumped into CWL files. Then others could run these "baked" (taking the term from computer graphics - "baking" a texture with shadows for example) pipeline assuming files are in place.

Integration will be really important because maybe CWL can handle cluster usage, AWS usage, etc on its own and so we don't have to implement that.

Correct me on CWL assumptions if I'm wrong (@ everyone reading this) - I want to have wrong assumptions on CWL which come from a lack of reading its docs / understanding it, and say things that annoy people who know it well, and then be corrected, and we can all have a happy discussion!

Something to get CWL lovers steamed:

thejmazz commented 7 years ago

cc @tetron @mr-c @pditommaso

bmpvieira commented 7 years ago

Thanks @olgabot :)

I've been aware of CWL since BOSC2015 and discussed with @tetron at Biohackathon 2015 how it could be used with bionode.

One thing I'm looking forward to is using CWL wrapped bioinformatic tools in a watermill pipeline because wrapping is a pain and I just want to have to deal with JSON objects in and out of a an existing tool (i.e., samtools). We should give it a try once more wrapped tools become available.

Cheers

mr-c commented 7 years ago

Hello again @olgabot and @bmpvieira; nice to meet you @thejmazz.

CWL is a standard, not a platform, and it has two specifications: one for describing command line tools, another for describing workflows made from those command line tools.

We don't see ourselves as being in competition with anyone -- our goal is to enable more tools and platforms to communicate and interoperate.

For maximum composability, users are encouraged to keep these in separate files and refer to the individual tool descriptions using an identifier, often a relative path but it could be something more portable. However a CWL documents and all of its referenced CWL parts can be 'packed' into a single file.

Nothing in CWL requires the use of filenames. It is a personal priority to encourage a move away from make style overloading of filenames with multidimensional metadata through better tooling and systems.

The CWL standards obviously don't run on the cloud, but many of the current implementations do so; we designed it to be mindful of platforms targeting unified filesystems or "shared nothing" filesystems.

The CWL project has four co-founders: two from academia, two from commercial open source producing companies. Yes, one of those is from the Galaxy project; but our goal for the project was for CWL to be a catalyst for all platforms to co-evolve with and towards. We certainly learned from the Galaxy project's decade plus experience along with influences from many other perspectives.

FYI: before the v1.0 spec was released we supported marking input and outputs as being streaming capable, where supported by the underlying tools. So big :heart: to this project's focus on likewise avoiding unnecessary writing to disk :-)

I'd be happy to schedule a video chat open to the public so that interested people from bionode and CWL can chat in real time. Just let me know!

thejmazz commented 7 years ago

Nice to meet you as well @mr-c :) Great project summary! Hope my uninformed comments were not taken in a bad way!

Nothing in CWL requires the use of filenames

❤️ ❤️ ❤️

Would love to have a chat as well too. Should read up a bit more on the CWL spec myself first though.