cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
328 stars 93 forks source link

Cycle-exclusive family triggers #3381

Open sadielbartholomew opened 4 years ago

sadielbartholomew commented 4 years ago

Describe exactly what you would like to see in an upcoming release

A user has requested a succinct way to trigger off of a sub-set of tasks in a family, where the sub-set contains only those family members which are otherwise defined within the current cycle.

As far as I can tell, there is no way to do this at the moment, except for systematically writing out all those tasks in a logical triggering statement, without utilising their family grouping. What is being requested therefore is a new syntactical element to specify this as qualifiers to existing family trigger syntax.

The example below should illustrate the desired concept.

Extended scope? [Now excluded as this emerged not to be a good idea, as conveyed in https://github.com/cylc/cylc-flow/issues/3381#issuecomment-535290798 & https://github.com/cylc/cylc-flow/issues/3381#issuecomment-535291558 ]

I've made the Issue title more general, however, as a natural extension that may be a useful feature is to also provide syntax for triggering off of a sub-set of tasks in a family, where the sub-set is those members which are defined within another cycle. This becomes effectively inter-cycle triggers with whole namespaces. See the extension to the below example for clarification.

Illustrative example

With the following as a (minimal representative) suite.rc, [-] is used as an example of a new syntactical qualifier:

[scheduling]
    initial cycle point = 20000000T01
    [[dependencies]]
        [[[R1]]]
            graph = "red => yellow"
        [[[T12]]]
            graph = """
            green => car
            blue => car
            COLOURS:succeed-all[-] => lorry  # this is the new aspect, explained below
            """
[runtime]
    # Create two families, VEHICLES & COLOURS, and some tasks within each:
    [[VEHICLES]]
        # settings...
    [[car,train]]
        inherit = VEHICLES

    [[COLOURS]]
        # settings ...
    [[red,blue,green,yellow]]
        inherit = COLOURS

Without [-], as we currently support, the cycles created every T12 will contain all members of the COLOURS family:

$ cylc validate -v <suite-name>
...
Graph line substitutions occurred:
  IN: COLOURS:succeed-all => lorry
  OUT: red:succeed & blue:succeed & green:succeed & yellow:succeed=> lorry
...

but the user would benefit from something as demonstrated here with [-] which instead expands out logically to:

  OUT: blue:succeed & green:succeed => lorry

since those are the only members of COLOURS already defined in those cycles. The red & yellow tasks are, in this new scheme, not run for those cycles.

Extended-scope example

For example keeping with the current syntax for family triggers & for inter-cycle dependencies, you could replace COLOURS:succeed-all[-] => lorry in the example above with:

COLOURS:succeed-all[^] => lorry

which would instead expand out to run only tasks in the COLOURS family which are already defined for the initial cycle point (note the ICP is at T01 so only the R1 recurrence provides tasks to the ICP), namely:

  OUT: red:succeed & yellow:succeed => lorry

Additional context

It may be that this is not possible with the current parsing, or indeed scheduling &/or task pool, logic, but I will register it as a user feature request regardless, as it may become possible in the future.

Pull requests welcome!

matthewrmshin commented 4 years ago

See also #2452.

oliver-sanders commented 4 years ago

Is the intention that this:

            green => car
            blue => car
            COLOURS:succeed-all[-] => lorry  # this is the new aspect, explained below

would be the same as this:

            green => car & lorry
            blue => car & lorry

?

The subset is going to have to be constructed once somewhere, could be done in the parameters section to avoid repetition:

[cylc]
    [[parameters]]
        subset = green, blue
[scheduling]
    [[dependencies]]
        [[[T12H]]]
            graph = """
                <subset> => car
                <subset> => lorry
             """

?

sadielbartholomew commented 4 years ago

@oliver-sanders I think you are missing the point of the request, as the user wants to a new syntax that will allow them to specify triggers based on families. I am sure they could do what is required for this deliberately-trivial example easily in another way, but in their case they had a very large suite with a lot of programmatic generation by task & family numbers where they wanted a short, simple way not prone to user error to specify tasks within a family & already known to the cycle without having to carefully interpret all the tasks in the family that were already there & set something equivalent via existing means.

oliver-sanders commented 4 years ago

Need a compelling use case.

sadielbartholomew commented 4 years ago

Need a compelling use case.

Fair enough, fortunately "the user" (who I will continue to refer to mysteriously as such, for anonymity, until they reveal themselves) has told me they may comment here. Hopefully they can flesh out their motivations for wanting this feature.

hjoliver commented 4 years ago

As it happens, I have a good use case, related to a recent question on the Discourse forum: how to trigger a task when all other tasks in a cycle point have finished?

If you have a single graph sequence in a workflow, you can just do this:

[scheduling]
   [[graph]]
        P1D = """
    foo => bar => baz  # etc.
    root:finish-all => cycle_done  # OK (single graph cycle only)
                  """

But it won't work if you have any other sequence (even an R1 at start-up) with different dependencies in it.

[scheduling]
   [[graph]]
        R1 = "prep => foo"
        P1D = """
    foo => bar => baz  # etc.
    root:finish-all => cycle_done  # ARRGGHH!
                  """

Now prep => foo appears in all cycle points, not just at start-up.

(For the uninitiated: family membership is defined by run-time inheritance - regardless of cycling - and use of a family name in the graph is just shorthand for "all member tasks"... so here, using the root family in a P1D sequence gives all defined tasks a P1D sequence).

dwsutherland commented 4 years ago

Can't we just do this:

    [[COLOURS]]
        # settings ...
    [[SUBSET]]
    [[red,yellow]]
        inherit = COLOURS
    [[blue,green]]
        inherit = COLOURS, SUBSET

Then with

SUBSET[-P1D]:succeed-all => lorry

We get

(flow) sutherlander@cortex-vbox:~$ wsgraphql subset nodesEdges lorry 1
{
  "request_string": "
query {
  nodesEdges(workflows: [\"subset\"], ids: [\"lorry\"], distance: 1, sort: {keys: [\"state\", \"id\"], reverse: false}) {
    nodes {
      id
      state
    }
    edges {
      id
    }
  }
}",
"variables": null
}
{
    "nodesEdges": {
        "nodes": [
            {
                "id": "sutherlander|subset|20000101T12|blue",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000101T12|green",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000101T12|lorry",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000102T12|blue",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000102T12|green",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000102T12|lorry",
                "state": "succeeded"
            },
            {
                "id": "sutherlander|subset|20000103T12|lorry",
                "state": "waiting"
            }
        ],
        "edges": [
            {
                "id": "sutherlander|subset|blue.20000101T12|lorry.20000102T12"
            },
            {
                "id": "sutherlander|subset|blue.20000102T12|lorry.20000103T12"
            },
            {
                "id": "sutherlander|subset|green.20000101T12|lorry.20000102T12"
            },
            {
                "id": "sutherlander|subset|green.20000102T12|lorry.20000103T12"
            }
        ]
    }
}

Maybe I misunderstood the issue...?

AdamVoysey commented 4 years ago

Hello – I now reveal myself as the mysterious “user”!

The motivating case comes from the LFRic test suite. Whilst @sadielbartholomew’s above example captures the feature suggestion well, it doesn’t shed much light on the problems in the original context by virtue of being simplified. The suite contains multiple NRUN-CRUN tests for different model configurations – all of which may be turned on or off for a given run. Each CRUN has a different number of repeats and length(s). (In short: each cycle has different tasks.) In addition to this, some CRUN configurations have extra (optional) post-processing tasks which may be done. All configurations which could be post-processed belong to a special family.

The first naïve attempt to implement this was to add a family trigger for the post-processing tasks protected by a simple jinja2 switch… something like:

{%- if post_process %} 
PROCESS:succeed-all => postprocess
{%- endif %} 

…but similarly to @hjoliver’s case, this has the effect of adding all possible post-processable tasks to every cycle. Hence unwanted dangling tasks with no pre-requisites that rapidly fail.

AdamVoysey commented 4 years ago

@oliver-sanders

The subset is going to have to be constructed once somewhere, could be done in the parameters section to avoid repetition:

[cylc]
    [[parameters]]
        subset = green, blue
[scheduling]
    [[dependencies]]
        [[[T12H]]]
            graph = """
                <subset> => car
                <subset> => lorry
             """

?

The problem is such a set is not known beforehand (is options dependent) and varies from cycle-to-cycle.

Technically, I suppose one could generate a (different) set of parameters for each cycle with jinja2. However such a solution would be:

And therefore will create quite a high maintenance burden.

hjoliver commented 4 years ago

@dwsutherland - yes you can manually define family groupings that only include the intended tasks in each cycling sequence. In fact that was my suggestion on the forum. However, I presume our "mysterious user" @AdamVoysey will think that is equally painful and fragile etc. in a complex suite. (If you get it wrong, suddenly your family name in the graph is adding unintended tasks into the cycle).

hjoliver commented 4 years ago

So I think there is a good case for family triggers that refer to members previously defined as belonging to the cycle, but do not add new members to the cycle.

I'm not convinced on @sadielbartholomew's extended syntax suggestion though, because the inter-cycle triggers are no longer explicit (and readable at a glance). Plus, what if the family has members in several other cycle sequences, not just one other?

hjoliver commented 4 years ago

This would require a new family trigger syntax, to distinguish from the normal (member-adding) case. We can't just change semantics wholesale to non-member-adding, as that would disable one of the main reasons for having family triggers in the first place (simplified graph, even for defining dependencies, not just for referring to existing dependencies)

sadielbartholomew commented 4 years ago

So I think there is a good case for family triggers that refer to members previously defined as belonging to the cycle, but do not add new members to the cycle.

I'm not convinced on @sadielbartholomew's extended syntax suggestion though

That is understandable from the outlined reasons. I'm happy to amend the title and nature of the Issue in that way, but I'll wait to hear from @AdamVoysey first (I'll ask offline or he might comment himself) to check that something along those lines would meet his requirements.

oliver-sanders commented 4 years ago

I think I understand the issue but I'm not convinced that introspection is the way to solve it. A lot of the introspective workflow definition requirements I've heard are more about hacking around the current static configuration system and can be solved more easily without introspection when we move to programmatic workflow generation. (though I have seen one compelling use case for introspection from Tomek).

Implicit "magic" has many downsides, I'm sure the Python API will support graph introspection but the explicit solution may be more compelling. The Python API should support sub-graphs which can be composed into a larger graph allowing more modular definition. Hopefully sub-graphs can provide a solution to this problem:

with Graph() as process:
    if opt_a:
        Foo
        Bar
    if opt_b:
        Baz >> Pub

with Graph() as post_process:
    Post_a
    if opt_a & b:
        Post_a >> Post_b
    if opt_c:
        Post_a >> Post_c

with Graph() as flow:
    process >> post_process

cylc.run(flow)

Note that the sub-graphs are Python objects which can be passed around and modified something like:

process.append(Qux)
hjoliver commented 4 years ago

@oliver-sanders - I certainly agree with you for the long run, but it may be a while before we get there. For the moment, it is definitely the case that users are sometimes surprised by their family triggers putting unexpected tasks into a cycle (of course it should not be "unexpected" if you understand how families are defined, and what a family name means in the graph ... but mistakes happen, or assumptions get made) - and that may be quite easy to fix. (But if not easy, I will also vote to wait for the Python API).

sadielbartholomew commented 4 years ago

@AdamVoysey has let me know offline that he is happy with @hjoliver's thoughts on what is sensible, i.e. the standard scope outlined in the opening comment, & what is not, i.e. the extended scope there, as we interpreted from the comments above (https://github.com/cylc/cylc-flow/issues/3381#issuecomment-535290798 & https://github.com/cylc/cylc-flow/issues/3381#issuecomment-535291558). In light of that I have amended the title & "crossed out" the description of the latter to reflect the updated nature of the feature request.

Leaving the milestone at Cylc-9 when (according to the current plan) we will have a Python API seems sensible. If anyone feels it can be tackled sooner they can always assign someone appropriate & revise the milestone tag.