common-workflow-language / common-workflow-language

Repository for the CWL standards. Use https://cwl.discourse.group/ for support 😊
https://www.commonwl.org
Apache License 2.0
1.45k stars 198 forks source link

Workflow validation and connectedness #378

Open StarvingMarvin opened 7 years ago

StarvingMarvin commented 7 years ago

There is an underspecified part of the spec regarding what constitutes a valid workflow. Here is what the spec says:

A workflow is a process characterized by multiple subprocess steps, where step outputs are connected to the inputs of downstream steps to form a directed acylic graph

Here are two examples:

                    +-----------+           +---------+
                    |           |           |         |
                +--->   Step A  +----------->  Out A  |
                |   |           |           |         |
                |   +-----------+           +---------+
+---------+     |
|         +-----+
|  Input  |
|         +-----+    +----------+           +---------+
+---------+     |    |          |           |         |
                +---->  Step B  +----------->  Out B  |
                     |          |           |         |
                     +----------+           +---------+
 +---------+          +-----------+           +---------+
 |         |          |           |           |         |
 |  Input  +---------->   Step A  +----------->  Out A  |
 |         |          |           |           |         |
 +---------+          +-----------+           +---------+

+-----------+
|           |
|  Input 2  |
|           |
+-----------+

If we apply wording of the spec literally, it would mean that the first workflow isn't valid, because steps don't create a DAG.

If we say, sure intention of the spec is to include workflow inputs into deciding if something is a DAG, then the second one isn't valid because there is an unconnected input.

I'm actually fine with the second interpretation, except there are workflows such as bcbio that are accepted by cwltool, but have some unconnected inputs. So I'm really unsure what is the rule for verifying correctness of workflow graphs.

LourensVeen commented 7 years ago

Both these examples have nodes, directed edges, and no cycles, so I'd say they're both perfectly valid DAGs. I read that line as saying that there can be connections between steps, not that the DAG has to be a connected DAG for it to be a valid workflow.

mayacoda commented 7 years ago

What would be the utility of having a disconnected DAG as a valid workflow? From what I can tell, the dangling input in the second example would serve no purpose in workflow execution.

EDIT: What I mean to ask is does it make sense to have n independent DAGs in the scope of a single workflow?

pgrosu commented 7 years ago

Luka, I would keep it simple by just implementing a topological sort, and if the number of edges is equal or greater than the number of nodes, then the graph is not a valid DAG indicating possible cycles.

A single node is a valid DAG, which can be viewed as a source of potentially streaming data, and you can have many parallel DAGs processed together since there is no dependency between them.

StarvingMarvin commented 7 years ago

@pgrosu It's true, but it was still unexpected, at least for me. I always assumed that workflow consists of one connected component. Drawing parallel to programming, if there is a function like this:

f(a, b, c, d) -> x, y {
    return g(a, b), h(c, d);
}

Even though I'm not aware of a language that would dissalow this, I would never write a function that does two completely independent things. Whether it should be actively forbidden? Maybe not, but I'm certain nothing good can come from it :)

Anyway, as multi-component workflow interpretation, doesn't seem to cause any problems, let's go with it, but I'll remain feeling suspicious about it...

pgrosu commented 7 years ago

@StarvingMarvin Pattern matching very common that it is allowed in many languages - here's a list:

https://rosettacode.org/wiki/Return_multiple_values

Though usually it's written more like this:

  x y = f(a, b, c, d) {
    g(a, b), h(c, d)
  }

Though parallelism provides several benefits via commutative disjoint sequence of steps - guaranteeing that we never encounter conflicts in evaluation - it also possible to restrict pattern matching to singleton returns usually constrained via the specification of the grammar of the language, together with the associated evaluation (substitution/reduction) rules:

Grammar: expr ::= value | expr

Evaluation rules:

       expr -> expr'
  _______________________

   E[expr] ->* E[expr']

This is read as the evaluation of expression will take a multi-step evaluation (->*), that will simplify at each step by a simple evaluation (E) of the expr to its reduced version expr'. Since the grammar requires a base-case of value, and not tuple or other multi-value constructs, it will enforce the equivalence to preserved. If the grammar and/or the evaluation rules are not pattern-match-satisfied at any point by an input or output - through each and every evaluation step - it will fail the validation automatically.

Here an example:

Below are the evaluation pattern-matching rules:

   (λx.expr) value -> e[v/x] which means substitute in expr, x with v

   if true expr_1 expr_2 -> expr_1
   if false expr_1 expr_2 -> expr_2

So with an expression:

   ( if ((λx.x) true) e1 e2 )

This will evaluate in very specific steps using call-by-value evaluation E ::= [.] | E e2 | v1 E, which just means that replace only one part of the program via substitution-and-elimination using [.] and evaluate via E. All this means to step left-to-right and evaluate as much as necessary until reaching a value, and proceed with v1 E to pattern match to until all is evaluated as much possible.

Thus using the pattern-matching rules we match to (λx.expr) value, and thus we get the following first simplification step:

  ( if true e1 e2 )

Then we pattern match again, and we match to if true expr_1 expr_2 and thus we result in e1 as our answer. Thus by providing the grammar and evaluation rules, we can always guarantee how we evaluate if we want to be precise about it.

Hope it helps, `p

StarvingMarvin commented 7 years ago

@pgrosu I'm not commenting on if multiple return values are possible. I'm commenting on the fact that each of those values are completely independent: that a and b are only used while calculating x and c and d are used for y. So point I'm trying to make is that there is no language I'm aware of that would forbid you from making a function that is disjoint in such way, but that it's bad programming practice non the less. So to translate this back to the workflow issue: I can't really find a solid reason to disallow unconnected workflows, other than my discomfort with such idea.

pgrosu commented 7 years ago

@StarvingMarvin Ah I see, so it's more of a philosophical point. Well, I agree that usually things should be as concise as possible in the implementation, but different languages have different design goals - and some are more expressive than others. Since CWL organizes a flow of work, and we compile these concepts utilizing other languages through their definitional facilities. The commutative nature of CWL has similarities to batch programming models (i.e. MapReduce, DataFlow, DryadLINQ, etc.), since we sometimes need to satify the definitions of the language, maximize load balancing, etc. Think of an operating system, we have many disjoint processes but the CPU needs to hyperthread in order to maximize for responsiveness and performance. The same is true for a webserver processing connections in parallel, which might not share common information, but strives to make availability a priority.

Different languages have different goals, but reshaping the goals of one languages is possible by specifying grammar and evaluation rules, especially when implementing a language through another language.

mayacoda commented 7 years ago

@pgrosu @StarvingMarvin, How would you expect this behavior to be described in the context of a CWL graph editor? This question came up in the Rabix team because we've always kind of assumed that Workflows should be connected DAGs and if one isn't, then it also isn't valid. Thus, we added that validation into the Cottontail editor, only to find that bcbio didn't conform to it.

Though I'm completely fine with Workflows being valid either way (and I understand the reasons for both), I'm interested in what would be the most useful type of validation in this case. Should connectedness even be considered during validation? Should disconnectedness generate a warning?

StarvingMarvin commented 7 years ago

@mayacoda Well, hopefully apps will have their required inputs marked as required, so at least it can be checked if required inputs are connected. I would also warn on unconnected inputs, outputs, or tools who's outputs aren't connected to anything.

pgrosu commented 7 years ago

@mayacoda So this gets into the area of deciding between well-formedness and permissiveness of the language. The former ensures resources are not wasted with purposeful steps - including consistency in building up via saved workflows - while the latter would allow a whole workflow to be become the input to query and dynamically build up from. I feel in the next decade things will move more towards the latter as distributed computing resources will become cheaper with the need to connect arrays of dynamic workflows to continuously compare/analyze ever-increasing amounts data. The former I feel would require implementations to keep each other in check, which might be harder to enforce, but would allow for expectations to remain consistent for cross-compatibility. Is there an example from bcbio where you encountered this issue?

Ideally we want consistency in the implementation of the language, but like Luka mentioned having a polite warning would nudge the user to ensure only what is required is part of the implementation. This warning system could initially have the following three options:

The first two would would still allow the workflow to run, while the last would ensure the Rabix-interpretation of CWL is respected. This validation implementation can itself be a pluggable CWL file, allowing users to customize how they prefer to validate and what they would like to emphasize.

mayacoda commented 7 years ago

@pgrosu I agree with your interpretation of the warning/error system, we will likely implement something very similar. Of course, the editor in no way prevents the user from creating workflows that conform to the spec as it is written and agreed upon by implementers. The only thing it can do is suggest best practices, which any good tooling system ought to. The reason this came up at all was our (false) assumption of what constitutes a valid workflow, as Luka said bcbio has unconnected inputs which a basic graph validation picked up as a problem. Otherwise, I see nothing inherently wrong with it, so to speak.