Should there be inputs in correspondance to outputs in Cylc?

cylc / cylc-flow

Cylc: a workflow engine for cycling systems.

https://cylc.github.io

GNU General Public License v3.0

335 stars 94 forks source link

Should there be inputs in correspondance to outputs in Cylc? #2764

Open TomekTrzeciak opened 6 years ago

TomekTrzeciak commented 6 years ago

Some time ago I've come across this post arguing that workflow engines should allow to explicitly model not only task dependencies, but also data-flow as a first class concept. Similar argument is made in this paper:

Workflow systems that lack explicit constructs for managing collections of data often lead to ``messy'' workflows containing either many connections between actors [...]; or many data assembly and disassembly actors; or both. The consequence of such ad hoc approaches [...] is that the modeling of workflows and the modeling of data become inextricably intertwined. This leads to situations in which the structure of the data processed by a workflow is itself encoded implicitly in the workflow specification---and nowhere else.

The recent write up on Vision for Cylc beyond 2018/2019 Priorities lists data-flow requirement as one of the key issues, with dependency calculation mentioned as motivation, but I think there could be potentially a lot of other benefits from having data-flow being explicitly represented in Cylc (e.g., workflow definition reuse and composition). I couldn't find any prior issue about this, so I've created this one to discuss further how data-flow could fit in Cylc.

hjoliver commented 6 years ago

In the early days of Cylc I decided a pure dependency engine was needed because:

existing systems had not even solved the dependency problem properly for cycling workflows (sequential whole cycles, or finite non-cycling workflows, were required...)
it seemed to me that the so-called "scientific workflow" systems typically didn't have to deal with the workflow size and complexity that we have in NWP (even without considering cycling), and were often aimed at users with no technical competence ... which made data pipelining less difficult, and more necessary, respectively
it also seemed to me that it was dangerous to make automatic decisions that result in data movement, given the endless variety (and volumes) of data, and ways of moving it, that is possible - best to let users make tasks to do it as they like.

However, I can see we're going to have to reconsider this. By the way, by "modeling data flow" presumably you do mean actually moving data around right? Otherwise it is just dependencies in disguise... (I'll read your references ... I have seen the paper before, somewhere in the distant past...)

hjoliver commented 6 years ago

I guess part of the point is that data flow may be more complicated than dependencies - i.e. that multiple inputs and multiple outputs may collapse into a single dependency, so we lose that information in a dependency graph. But are you also talking about automatically moving the data around according to specified input/outputs?

TomekTrzeciak commented 6 years ago

@hjoliver, data-flow modelling doesn't have to mean data movement. IMO it's more important to explicitly capture data dependencies within the workflow than to deal with the actual data movement. It might be sufficient to:

have named inputs and outputs that get bound to some kind of data references (like file names)
tag input/output with a type (local file, s3 object, ...) and provide a plugin system for handling different types (similar to how file installation is handled in Rose); local files (the most common case) could be assumed accessible as is and without any data movement

This way you could leave it up to the users to define their own data handlers, which could take care of data movement and whatever else is needed. The most useful and generic of these plugins could be then bundled with Cylc but still kept out of the engine core.

Now, about importance of data-flow modelling. Data dependencies can be considered task dependencies in disguise, as you say, but only if you look from the workflow execution perspective. If you switch the point of view to system building, however, they become much more important. After all, it is the data produced by the workflow rather than the workflow execution itself that we are primarily interested in (well, depends a bit who you ask, I guess, but bear with me). I would compare workflows without explicit data-flow to programming using subroutines without arguments and return values. Instead all input and output values would be passed implicitly through global variables and only call dependencies would be explicit. Not impossible to work with, but for building large systems y = f(x) proved much more expressive in programming.

If you put execution at the nodes of the workflow graph, then the data-flow naturally forms the edges in that graph. In order to decompose a large workflow into smaller, self-contained units, you need to cut through some of the edges and output/input pairs naturally form the joining points. Conversely, in composition of larger workflows from smaller units, input and outputs that are not consumed internally naturally aggregate at the workflow interface such that the composite workflow can look the same way as an atomic task - a black box with inputs and outputs. This enables modularization and hierarchical system building without explicit knowledge about internal structure of workflows as is currently necessary, e.g. for inter-suite triggers (either via built-in syntax or polling tasks) - this is what I alluded to above as workflow definition reuse and composition.

hjoliver commented 6 years ago

@TomekTrzeciak - nice explanation, thanks. I generally agree, but I would still push back a little: Task dependencies obviously are data dependencies really, it's just that the actual data is not described at the workflow level (you have to look at the task configuration). And your analogy with subroutine arguments is nice, but explicit data dependencies amount to nothing more than clearer documentation of task interfaces (relative to task dependencies) unless the data is passed around automatically (where necessary) - whether by built-in means or user-supplied plugins - which is what was vexing me before. To push the point, explicit subroutine arguments that inside the subroutine do not actually reference the data that they represent would only be as good as execution dependence plus documentation. But now that you have explicitly made the connection to data movement (and that user-supplied plugins or similar would be needed at least some of the time) I'm on board - this would indeed be better than plain task dependencies :boom:

hjoliver commented 6 years ago

BTW, worth reading the comments at the bottom of this post as well as the post itself.

oliver-sanders commented 6 years ago

@TomekTrzeciak Really good explanation! @hjoliver Good find with that post, I think the comment sums up the problem very well.

I've not yet written anything about dependencies in https://github.com/cylc/cylc/wiki/Possible-Cylc-Futures. Workflows may have a few kinds of dependency:

1) Data dependencies 2) Scheduling dependencies 3) Resource dependencies 4) Others? (trigger dependency, e.g. on-demand workflow)

I have thought about how we could make Cylc more "multi paradigm" when it comes to dependencies but haven't come to any real conclusions. In a way we already tick these boxes with:

1) Custom outputs 2) Regular dependencies 3) Queues

But these solutions are partial at best. Ideally we would express them all in the same syntax.

See also:

2616
2334
2413
https://github.com/cylc/cylc/pull/2423#issuecomment-330556547

Going back to inputs and outputs I guess it all depends how far you want to go with it. We could just use input / output information to generate the required dependencies and populate a couple of environment variables. Cylc doesn't have to check or manage outputs, just pass the required messages onto the tasks. Besides outputs might not refer to resources anyway, they might be messages or even the resource itself!

hjoliver commented 6 years ago

This is the kind of discussion that would be better had on Slack or similar (coming soon, hopefully).

... Cylc doesn't have to check or manage outputs, just pass the required messages onto the tasks ...

True, but remember this input/out information is necessarily already encoded in the suite definition, albeit in a user-defined form in task definitions rather than in the graph. So, if the new "data flow" model is not doing anything more with the specified inputs/outputs than deriving dependencies and setting environment variables, then what have we gained beyond a more exposed documentation of what the dependencies mean?? (And at the cost of a less transparent graph configuration ... although I suppose we'll lose the simple graph string representation with a Python configuration API anyway).

How about a consistent URL format for input/output resources of various kinds, and handlers (built-in or user-defined) to be used (automatically, or perhaps deliberately in task scripting) to retrieve and potentially validate inputs (and "retrieve" could mean just resolve to a local file path, for local files).

Then it is more than just documentation, if the inputs/outputs are URLs that actually point to the associated resources and can be used to retrieve them for use.

(Maybe this is what @oliver-sanders and @TomekTrzeciak were thinking all along?)

hjoliver commented 6 years ago

Further, if inputs can be referenced in a generic way - i.e. you don't have to know what your inputs are called or who generated them, in order to retrieve them for use, that would be very helpful for distributed systems ... of sub-suites or similar. The data handlers could take name transformation arguments, to rename the retrieved resource to whatever is expected by the consumer.

oliver-sanders commented 6 years ago

what have we gained beyond a more exposed documentation

A workflow which is auto-assembling from a pool of tasks, not a different way of running workflows just a different way of defining dependencies.

although I suppose we'll lose the simple graph string representation with a Python configuration API anyway

We can continue to write dependencies as we currently do (e.g. by overriding the bitshift, + binary and & or operators):

graph = cylc.graph(
    foo >> bar & baz >> pub,
    pub | fab >> qux
)

https://github.com/cylc/cylc/wiki/Possible-Cylc-Futures#graphs

if the inputs/outputs are URLs

Interesting, we could merge this with Cylc message?

hjoliver commented 6 years ago

A workflow which is auto-assembling from a pool of tasks, not a different way of running workflows just a different way of defining dependencies.

Are the two halves of that sentence meant to be together (wondering if it's a typo!)?

A workflow which is auto-assembling from a pool of tasks,

How does specifying inputs/outputs result in a "self-assembling workflow" any more than specifying dependencies does?

not a different way of running workflows just a different way of defining dependencies.

I agree with that, but not sure I'm interpreting your words correctly in light of the "auto-assembling" bit. Using inputs and outputs is (seems to me) merely a better way of defining dependencies (better in that it documents what the dependencies mean) unless we can use the information for more than that, as I'm advocating above.

We can continue to write dependencies as we currently do (e.g. by overriding the bitshift ...

Indeed, but not if we switch to inputs/outputs instead of dependencies (at least is would get a lot messier, perhaps to the point that an intuitive looking notation would not be of any benefit over supplying function args)

oliver-sanders commented 6 years ago

How does specifying inputs/outputs result in a "self-assembling workflow" any more than specifying dependencies does?

My use of the term self-assemblihng is probably confusing, I'm about to make a bad job of explaining this as I'm not involved in writing these sorts of suites, @TomekTrzeciak can probably do a better job.

Imagine we have a large programmatic suite where various bits are turned on or off based on logic. In different configurations the same data could come from different tasks and the graph as a whole could have a different structure. Without the ability to perform introspection in the suite definition API it would require a vast amount of logic to keep track of which tasks and dependencies have been included.

This is the sort of thing we see quite regularly, examples like this can exist at a much greater scale as in Tomek's case(s):

            graph = """
{% if A %}
    {% if B %}
                foo => bar
        {% if D %}
                foo => pub
        {% fi %}
    {% if C %}
                get_foo => bar
        {% if D %}
                get_foo => pub
        {% fi %}
    {% else %}
               bar
    {% fi %}
                bar => baz
{% fi %}
             """

[runtime]
{% if D %}
    [[pub]]
        [[[env]]]
    {% if A & B %}
            INFILE = 1
    {% elif A & C %}
            INFILE = 2
    {% fi %}
{% fi %}

Expressing using dependencies we cant easily add and remove bits of the workflow, the suite writer has to include masses of logic to explicitly cater for each combination of options resulting in the mess of Jinja2 above.

Expressing the requirement in terms of data makes things simpler. Tasks are included or not based on much flatter logic, their inputs and outputs are then used to auto-generate the graph. For instance:

foo:
    in:
    out: a

bar:
    in: a
    out: b

baz:
   in: b
   out: c

The graph auto-assembles to foo => bar => baz. This is what I mean by auto assemble, its not necessary to manually string tasks together into chains.

If we turned on option D as in the first example:

pub:
    in: a
    out:

The graph now auto-assembles to foo => bar => baz; foo => pub.

No need to specify that a is coming from foo, the user doesn't have to manually specify the appropriate logic they only have to ensure that there are matching outputs for every input (else the suite would raise an error at validation).

It's a perspective thing, as Tomek said it's about whether you are designing from the perspective of the graph or the tasks:

... if you look from the workflow execution perspective. If you switch the point of view to system building, however, they become much more important.

hjoliver commented 6 years ago

@oliver-sanders - thanks, excellent explanation!

On reflection, I can see this also goes back to what @TomekTrzeciak was saying in our in-person meeting - which I agreed with at the time - but for some reason (perhaps the aforementioned blog post and its Galaxy screenshot) I had got it in my head that you guys were now advocating explicitly "drawing the graph" - i.e. making connections between tasks - but in terms of individual inputs and outputs instead of dependencies. (No doubt a conversation in front of a white board would have resolved this in minutes ... the perils of distributed teams).

Ironically what you describe is more or less exactly my original vision for Cylc, how it was originally implemented (separate task definitions that specified their "prerequisites" and "outputs", albeit in separate files), and how it still works internally (matching prerequisites and outputs). The graph interface is merely a shorthand for defining the most commonly used prerequisites and outputs that I found people were using in the early days (namely "task succeeded" messages), plus a reaction to complaints that it was too hard to understand the structure of the workflow (but in the early days I had no static graph visualization capability either, which I suppose might have been sufficient). Oh well... back to the future...

oliver-sanders commented 6 years ago

more or less exactly my original vision for Cylc

Absolutely!

taskpool

matthewrmshin commented 5 years ago

cylc / cylc-flow

Should there be inputs in correspondance to outputs in Cylc? #2764

2616

2334

2413