Proposal: defining new common schema to describe concrete job execution plan

junjun-zhang commented 6 years ago

Not sure this is the right place to start this type of discussion, but here it is.

I feel I shouldn't be alone wondering if there is anything we can do to deal with the situation that there are so many different workflow definition options. CWL, WDL, Toil, Galaxy, Airflow and Nextflow just to name a few used in the bioinformatics world. I don't expect any of them go away, it is important to have diversified choices for workflow authors. However, there are increasing demands for a particular workflow execution engine to be able to run workflows defined in different workflow languages. Instead of each of the six execution engines (mentioned earlier) writing five different parsers, totally 30 parsers, is there a better way to do it?

My thought is that it would greatly reduce the effort if we can come up with a new common schema to describe concrete job execution plan that can be compiled from a workflow defined using any of the existing workflow languages. With this approach, each workflow language will only need to write one converter to convert its own execution plan into the new common schema. Then if an execution engine would like to support workflows written in other languages, it just needs to implement the capability to execute the 'common schema'.

Is this feasible? We don't necessarily need to cover all workflow languages, being able to support the most popular two: CWL and WDL should be good enough and probably a good starting point.

Chatted with @geoffjentry this afternoon, he seems agree with this approach.

Thoughts?

junjun-zhang commented 6 years ago

Standards and interpretability are the very goal of the GA4GH. Making major bioinformatics workflow systems interoperable will result in big potential of expanded collaborations and reduction of repetitive efforts.

tetron commented 6 years ago

We tried to do this with Common Workflow Language and have succeeded a bit -- there is already varying levels of CWL support among all the workflow engines mentioned in your example, including Toil, Galaxy, Airflow, Nextflow, Cromwell, and Arvados.

Currently there's two main challenges:

Workflow languages that have features that don't exist in CWL - this a valid concern, but in principal CWL can be extended to support these features.
Workflow languages are underspecified - this tends to be the biggest issue. Since WDL, Nextflow, Snakemake etc are all defined by a single implementation (as opposed to the spec-first approach of CWL) there are many edge cases which are not well defined, or rely on hidden, non-portable assumptions about the runtime environment.

There's also a question of what is the right level of abstraction. A shell script describing a workflow is "portable" but lacks the necessary abstractions to be scheduled across parallel nodes. It's not clear to me there's a useful middle layer that can be extracted between workflow description and actual job scheduling. However, CWL could provide a common abstraction to which more user-friendly workflow languages (like WDL) are converted to. Someone just has to sit down and harmonize the semantics.

junjun-zhang commented 6 years ago

@tetron thanks for sharing your thoughts. Good, it's not all that depressing :). I feel encouraged by

Someone just has to sit down and harmonize the semantics.

My idea is actually to focus on 'common core' among them. More specifically, the 'common core' at the concrete execution plan that is compiled down from the original high level definition which human users interact with. For example, someone embedded JavaScript in input of a CWL step, I don't expect this is going to be portable to other workflow systems. The way to address this in another workflow system is to take the concrete execution plan after all of the inputs are resolved by a piece of CWL-aware code. This in principle should work for other languages, for example, Airflow which writes workflow jobs in Pyhon code, the code will be run once to generate the concrete execution plan which is then completely detached from the original Python code.

The above approach should be able to address the three points you mentioned at least to some degree.

One the other point is that if some features are so unique that can not be converted to the 'common execution plan', the conversion code will just issue an error to inform the user. Then the user would either accept the fact her workflow written in X worklow language can not be converted to common schema, or she would adjust to workflow to avoid using the unique feature.

tetron commented 6 years ago

Here's a really simplistic script that compiles simple CWL to shell script as an example of breaking down a workflow into concrete steps. (it is just a proof of concept and I don't think its even been updated for CWL v1.0)

https://github.com/common-workflow-language/cwl2script

In principal you could create a script of operations which execute linearly and use simple synchronization primitives (e.g. condition variables) but you need to pick your primitives carefully. You need operations that describe how to work with files and directories, launch tasks, wait for task completion, and translate the output of one step to input for the next. For many workflows, there are dynamic elements that can't be fully evaluated until the upstream dependencies have executed, so a workflow execution plan has to be able to evaluate (and branch on) those dynamic elements as well.

buchanae commented 6 years ago

Thanks for writing this up. I've been pondering similar ideas for awhile. Seems like most engines are already converting workflow syntax into a lower level abstraction (i.e. workflow object model). There could be value in surveying the models underlying existing workflow engines and presenting a schema that captures the commonality between them. I think this lower level abstraction could end up being much more portable than the higher level workflow languages, and might free up workflow language authors to focus on features rather than portability and execution. It's sort of like C/Assembly vs Python/Ruby/etc.

I've also thought that TES Tasks could be the core primitive for this. Tasks could be linked together into a DAG by input/output URLs. I have done something like this for Galaxy workflows, and we've done this more indirectly by adding TES backends to Bunny and Cromwell.

I think making dynamic (not statically evaluated) workflow language expressions (e.g. Javascript, WDL expressions) portable is possible. If evaluating the expression is just another task in the DAG, it might fit in well.

junjun-zhang commented 6 years ago

Thanks for chiming in, @buchanae. I do feel it's like various programming languages based on JVM, such as, Scala, Clojure, Kotlin etc, they are all compiled down to Java bytecode, then it can run anywhere JVM is available.

As for dynamic evaluation, this is exactly what I had in mind. It is actually could be done via a web service offered by another workflow language. For example, when an execution engine needs to have just-in-time evaluation of to generate execution DAG from the other language, it just calls a web service. I am writing a new workflow engine, this is the plan I have in mind to support workflows written in other languages.

For web service, I mean something like this: https://view.commonwl.org/workflows/github.com/genome/arvados_trial/blob/master/unaligned_bam_to_bqsr/align.cwl, which by the way, I like a lot!

geoffjentry commented 6 years ago

@buchanae It came up a few times at GA4GH (including by @junjun-zhang ) where people asked if we (Cromwell) were intending to try to push our "WOM" as a standard. It is an intriguing idea and one I hadn't considered previously. My response in all cases was that I'd want to wait until our CWL project is complete and see where we're at. It's possible that some serious crimes against humanity are committed in the sake of getting things done. Maybe it's perfection. Don't know. But it's a thought.

buchanae commented 6 years ago

@junjun-zhang I'm interested to see how your approach to dynamic evaluation turns out, in particular a couple of the trickier cases:

Expression might access the header of a large BAM file.
Expression might result in a list of new tasks, in order to express a scatter.

I guess these could be implemented using webhooks, where the low level engine executes the expression (possibly defined as a docker container) and calls the webhook with the result (webhook would be workflow language specific). Or, maybe the ability to generate new tasks is a special case that should be handled by this low-level system. I dunno. Not sure I've wrapped my mind around that part :)

@geoffjentry Makes sense. Hindsight will be 20/20 :) Looking forward to it. This is why I was poking around the CromWOM model.

junjun-zhang commented 6 years ago

Being able to generate tasks on the fly is the basic idea for dynamic evaluation. In JTracker (the system I am working on), scatter tasks are generated at execution time, number of scatter tasks could be determined by output of a previous step. In complicated cases such as reading BAM header, it's possible to define a dedicated task to perform the complex logic and then generate sub-DAG that gets incorporated back to the original DAG. JTracker does not support calling sub-workflow yet, the plan is to have a special task that gathers needed inputs then generate sub-DAG for the sub-workflow. This approach would even make it possible to perform recursive task execution (same task calls itself until certain condition is met) often needed to machine learning algorithms.

This is fun discussion, but it seems it diverged from the original topic, that is, can we come up with a new miniaturized standard for concrete execution plan?

tetron commented 6 years ago

@junjun-zhang I think the point is that a design for a concrete execution plan needs to accommodate dynamic elements. That might mean generative steps (steps which can generate additional steps).

junjun-zhang commented 6 years ago

I guess you are right. For an execution engine aims to run workflows written in different languages, it's important to support scatter tasks that are generated from abstract task. Yes, there may be other types of generative tasks. More to think about.

If done well, this work will help pave the road leading to another ambitious goal: the ability to compose workflows with tools and/or sub-workflows written in different languages by others.

jaeddy commented 5 years ago

I'm glad that this discussion will be archived in GitHub for posterity, but feels outside the scope of WES at this point.

ga4gh / workflow-execution-service-schemas

Proposal: defining new common schema to describe concrete job execution plan #12