cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
332 stars 93 forks source link

trigger: generalisation of triggering approaches #4686

Closed oliver-sanders closed 2 years ago

oliver-sanders commented 2 years ago

Related Issues:

If agreed this issue should supersede:

After a long chat with @dpmatthews (who proposed yet another triggering approach 😁) I think we can generalise the trigger problem into two dimensions:

Note: From the internal implementation these two dimensions may appear flip-sides of the same coin since they both boil down to the flow_nums, however, considering them from a user standpoint I think it's fair to prise them apart.

Note: Purposefully using new terminology to avoid conflation with existing terms, we may want to workshop "continue" and "overrun" a touch.

[1]: The quoted "merge" above relates to the interaction between two tasks with different flow_nums in general and not to the more specific concept of "flow merging" in the pool exclusively.

Combing these we get four spaces:

Continue Don't Continue
Overrun (1) Reflow (as currently implemented) (3) No Flow (current default trigger behaviour)
No Overrun (2) Continue (@dpmatthews new proposed implementation) (4) No Flow (@oliver-sanders proposed implementation)

Going through the four spaces in detail:

1) Reflow (implemented)

Equivalent to cylc trigger --flow=<new-flow-number>.

Continue: Yes Overrun: Yes

The use case is for re-running over tasks which have been previously run e.g. change configuration and re-run a sub-graph.

2) Continue (proposed)

Equivalent to cylc trigger --flow=<all-flow-numbers>,<new-flow-number>.

Continue: Yes Overrun: No

This approach feels quite "natural". The use cases are setting off another bit of the same flow where you don't want tasks to be overrun.

3) No Flow (implemented)

Equivalent to cylc trigger --flow -1.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present but would need to change the marker.

Continue: No Overrun: Yes

Useful for running one-off tasks that you do not want to impact the workflow in any way (i.e. cylc submit type uses).

4) No Flow (proposed)

Equivalent to cylc trigger --flow -2.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present.

Continue: No Overrun: No

Use case is for manually intervening in graph execution by ignoring dependencies or runahead limit and skipping ahead to a task which you want to be considered a part of the approaching flow front.

Interface

The internals to handle the four cases are already in-place, flow_nums, DB lookups etc, so it mostly boils down to an interface / documentation issue.

I think all four methods could be exposed via a single --flow argument, however, it is sensible to provide defaults for the different behaviours. I think it would be good to document the --flow equivalents as they may help users to understand their function.

Note that --reflow currently determines the new flow number server rather than client side which is sensible.

1) Enable behaviours explicitly

If we are happy with the continue/overrun model (after workshopping the terms) we could expose it directly something like:

# 1) reflow
cylc trigger --continue --overrun

# 2) continue
cylc trigger --continue

# 3) no-flow (implemented)
cylc trigger --overrun

# 4) no-flow (proposed)
cylc trigger

This is quite nice as you have to explicitly opt in to each behaviour separately reducing the scope for unintended results and accidents.

2) Single --flow argument

if we don't like the continue/overrun model we could move the presets into the flow argument something like:

# 1) reflow
cylc trigger --flow=new

# 2) continue
cylc trigger --flow=any

# 3) no-flow (implemented)
cylc trigger --flow=none

# 4) no-flow (proposed)
cylc trigger --flow=next

It's less behaviour driven so we would need to explain each option separately.

3) Separate flag for each approach

An alternative to (2) would be to could come up with three/four different flags:

# 1) reflow
cylc trigger --reflow

# 2) continue
cylc trigger --flow

# 3) no-flow (implemented)
cylc trigger --rerun

# 4) no-flow (proposed)
cylc trigger  # --run

Default

I think no-continue & no-overrun is the safest, sanest default because:

But I'm biased. I think the default is less important than the clear separation of behaviours.

hjoliver commented 2 years ago

Nice write-up @oliver-sanders :+1:

I generally agree on the matrix of cases we need to support, in terms of the functional capability.

I'm not so sure on the conceptual description, bearing in mind that as you said, the terminology needs work-shopping.

Continue (yes/no).

  • After I trigger the task will the flow continue from that point immediately.
  • Or does it only continue if/when a flow front catches up with it.
  • I.E. Should the triggered tasks spawn children on completion or after "merge".

I find this use of "continue" a bit brain-twisting. continue=no means continue if/when a flow catches up? So Case 3 is classified as overrun=yes, continue=no which has to be understood like this: don't continue immediately; but do continue if/when a flow catches up and merges; BUT that won't happen because the triggered task can be overrun by any following flow?

Overrun (yes/no).

  • Should the "merge" [1] condition be based on the pool or the DB?
  • I.E. Should triggered tasks overrun previous runs of tasks?

The "no" case of this is meaningless isn't it? If you trigger a task anywhere behind any flow, including the original flow which typically starts at the start of the graph, but it can't overrun previous runs, then the triggered task itself can't even run, no? Or do we need to specify which flow(s) it can't overrun?

  • I.E. Should the following flow overrun the triggered tasks?

I think it's intuitive to say "my newly-triggered flow can overrun previously-run flows". I'm not so sure that "my newly-triggered task/flow cannot be overrun by upcoming flows" is the best way to describe that concept though. Better to think of it as "my newly triggered flow is part of the upcoming flow(s)" (i.e. I am "pre-running" bits of the upcoming flow(s)).

hjoliver commented 2 years ago

In other words, I think we should describe what happens in terms of the task or flow being triggered:

I am triggering this task,

(with the n=0 task pool merge proviso in all cases)

hjoliver commented 2 years ago

A new trigger approach proposed by @dpmatthews.

I like the idea of adding a new flow number to triggered tasks (that aren't actually starting a new flow) so that they can be targeted with the CLI.

I'm not so sure about the "add all existing flow numbers" bit. How is that better than just merging with the first flow that comes along, whatever its number? Do you want avoid merging with other flows triggered after the first triggering event?

This is intended for the sort of use cases we would expect --flow=1 to be used for, but has been generalised to be reflow compatible.

Not sure I follow this. How is --flow=N not reflow-compatible?

it will not be overrun by any of the flows which exist at the time of the trigger.

This seems to suggest that the triggered task will merge with every subsequent flow that catches up with it? (or at least, every flow that was present when the task was triggered).

I would think we want to support pre-running a task for a particular flow (or perhaps for the first one that catches it), but the first merge neutralises it so that further flows can overrun it.

E.g. consider pre-running a task for the current main flow; then much later start an entirely new flow to re-run that part of the graph - it would be surprising if some hidden long-ago triggered task was treated as already-ran and merged with the new flow? I don't think we can tell the new flow not to merge with the old task, because that would preclude later decisions to pre-run tasks in the new flow.

[Update: your CLI examples seem to suggest you don't mean every subsequent flow should merge with the triggered task :+1: ]

Still considering the CLI options you've laid out ... at first glance I prefer something like 2.

I don't mind the default being as you describe, for triggering tasks in front of a flow. (Just not keen on the description in terms of "continue" and "overrun", for reasons above).

oliver-sanders commented 2 years ago

Overrun (yes/no).

The "no" case of this is meaningless isn't it?

Just not keen on the description in terms of "continue" and "overrun", for reasons above

I think the continue/overrun model is valid, I don't think the "overrun" part has been communicated.

I'm describing behaviour from the user's perspective not the implementation perspective.

Here's an example to show how the overrun=yes/no cases differ and hopefully clarify the 'continue' bit.

a => b => c => d => e => f

Lets start with task "a" running in the original flow, then trigger task "d" using each of the four methods.

The following examples show what the workflow would look like after it has finished running.

1) Reflow

cylc trigger --flow=2

The two flows never meet in the pool so never merge.

  • Does the flow "continue" from task "d" immediately after it completes: yes
  • Does the task "d" get "overrun": yes

2) Continue

cylc trigger --flow=1,2

The two flows never meet in the pool, however, they do meet in the DB. Because the triggered tasks belong to flow=1 they will not be overrun.

  • Does the flow "continue" from task "d" immediately after it completes: yes
  • Does the task "d" get "overrun": no

3) No-flow (implemented)

cylc trigger --flow=-1

The triggered task leaves the pool. As it is not considered to belong to any flow it will be overrun.

  • Does the flow "continue" from task "d" immediately after it completes: no
  • Does the task "d" get "overrun": yes

4) N-flow (proposed)

cylc trigger --flow=-2

The triggered task leaves the pool. As it is considered to belong to any flow it will be overrun.

  • Does the flow "continue" from task "d" immediately after it completes: no
  • Does the task "d" get "overrun": no
oliver-sanders commented 2 years ago

A new trigger approach proposed by @dpmatthews.

I'm not so sure about the "add all existing flow numbers" bit. How is that better than just merging with the first flow that comes along, whatever its number? Do you want avoid merging with other flows triggered after the first triggering event?

Approaches (2) and (4) are both trying to solve the overrun=no problem but are doing it in different ways.

Approach (4) uses a special marker to say "merge me with anything", approach (2) explicitly lists all of the flows tasks should be considered to belong to.

The reason we are not using the special marker to implement (2) is because that would prohibit a reflow from being able to overrun the tasks.

This is intended for the sort of use cases we would expect --flow=1 to be used for, but has been generalised to be reflow compatible.

If there are multiple flows (e.g. 1 & 2) active at the time the trigger is performed, --flow=1 will not necessarily do the job. What if the triggered task is caught first by flow=2? It would merge if in the pool but be overrun if no longer in the pool.

I would think we want to support pre-running a task for a particular flow (or perhaps for the first one that catches it), but the first merge neutralises it so that further flows can overrun it.

This is the use case for trigger type (4) [continue:no, overun:no].

The task will merge with the next flow that catches it, subsequent flows can overrun it.

E.g. consider pre-running a task for the current main flow; then much later start an entirely new flow to re-run that part of the graph

[Update: your CLI examples seem to suggest you don't mean every subsequent flow should merge with the triggered task 👍 ]

Because we explicitly listed all of the flows the triggered task belongs to when we issued the trigger, any new flow numbered that are subsequently added (by reflow) will be able to overrun the task.

Still considering the CLI options you've laid out ... at first glance I prefer something like 2.

If we can come up with good --flow descriptors this could work quite nicely.

wxtim commented 2 years ago

I'm going to want animated GIFs if I have to explain this to users.

dpmatthews commented 2 years ago

Still considering the CLI options you've laid out ... at first glance I prefer something like 2.

If we can come up with good --flow descriptors this could work quite nicely.

Some suggestions ...

cylc trigger --flow=new

Seems fine - this is a new flow which is separate from any existing flow (unless it merges).

cylc trigger --flow=any

How about --flow=current - this is part of the current flow(s) (hence no overrun).

cylc trigger --flow=none

Seems fine - this isn't part of any flow.

cylc trigger --flow=next

How about --flow=wait - wait for another flow to merge before continuing.

hjoliver commented 2 years ago

Coming back to this ...

I understand the "continue" and "overrun" terminology, but IMO it's more intuitive to think of it like this:

The triggered flow can:

[1] by "say it belongs to an upcoming flow" I mean any one of:

The main point of confusion of course is that continue=no means - by what I think is the normal semantics of the word! - "continue later" (after a merge). (I suppose you might say the triggered task does not continue, but the flow that merges with it does, but IMO are they are really one and the same).

Finally, "overrun" makes more sense, but it's kind of intuitively obvious in the following sense:

dpmatthews commented 2 years ago

@hjoliver It sounds like we're all happy with the 4 options - the issue is to how best describe them? I'm not clear what you're suggesting. What do you think of my suggestion? I'd avoided using the words "continue" and "overrun".

I'm still reasonable happy with --flow=new, --flow=none and --flow=wait as described above.

How about --flow=current - this is part of the current flow(s) (hence no overrun).

This seems the most difficult to describe. Other suggestions: --flow=join, --flow=combine, --flow=old, --flow=inherit These are all just different ways of saying become part of the existing flows(s).

oliver-sanders commented 2 years ago

The main point of confusion of course is that continue=no means - by what I think is the normal semantics of the word!

As stated in the OP I am not proposing the words "continue" and "overrun" and am fully aware of the issue with "continue".

Note: Purposefully using new terminology to avoid conflation with existing terms, we may want to workshop "continue" and "overrun" a touch.

If we are happy with the continue/overrun model (after workshopping the terms) we could expose it directly something like:

However, I am suggesting that this model for understanding the different methods might make for an intuitive interface given sensible substitutions for these words as it would help users to understand the ways in which the different trigger methods are similar and the ways in which they are different.

hjoliver commented 2 years ago

As stated in the OP I am not proposing the words "continue" and "overrun" and am fully aware of the issue with "continue".

Well your case 1 above does suggest those words as CLI options; and you did describe the whole thing in those terms, which I initially struggled to understand because the of the "continue" issue, and which I assumed (maybe wrongly) might be how you'd want to describe it to users too.

(Maybe I should have moved on from this discussion already, but I came back to it after a few days and it didn't seem fully resolved).

I'm not clear what you're suggesting.

Sorry, I wasn't suggesting anything there, just explaining why I don't think the conceptual description in terms of continue/overrun above is very helpful, compared to simply describing the triggered task belonging to an upcoming flow or not (and if it does belong, when the flow encounters it, it knows the task has already been run and so won't run it again).

What do you think of my suggestion? I'd avoided using the words "continue" and "overrun".

That's the next step, which I meant to follow up on the other day but ran out of time ...

hjoliver commented 2 years ago

OK, on to the real point ... CLI

I like the --flow=blah approach.

It seems pretty intuitive, and it remains compatible with giving actual flow numbers if that's ever needed. E.g. trigger a task in front of multiple flows, but only one of them should merge with it (I'm not sure what the use case is for that, but I'm also not sure what the use case is for having any one of several upcoming flows merge with it ... the reality is, I don't really see people using multiple upcoming flows in close proximity like this until we can support entirely independent flows).

How about a small variation on the --flow= suggestions above: if you trigger a task, expect it to continue (flow on) as normal unless --flow=none (one-off task) or --wait (wait for catch and merge before continuing).

a) trigger a one-off task that does not affect other flows:

$ cylc trigger --flow=none

b) trigger a new flow, that is not part of any existing flow:

$ cylc trigger --flow=new

c) pre-run parts of the graph ahead of an existing flow or flows:

# trigger and continue as normal
$ cylc trigger --flow=any  # any current flow can catch up and merge; others can overrun
$ cylc trigger --flow=1,2  # flows 1 or 2 can catch up and merge; others can overrun

# trigger as above but wait for merge before continuing
$ cylc trigger --wait --flow=any
$ cylc trigger --wait --flow=1,2

I like this because waiting is singled out as unusual behaviour. For a spawn-on-demand scheduler, the default should be to spawn downstream children as normal (i.e. as outputs are generated).

Also, it makes it clear that continue-now vs continue-later are essentially the same thing as far as the flows are concerned, just with a delay in the wait case.

Finally, maybe --flow= should be a required option, to force users to think about what they want, because getting it wrong could be costly.

wxtim commented 2 years ago

Can I confirm that my sketch matches the four Options @oliver-sanders ?

IMG_20220222_073945.jpg

In which case I like the terms,

And suggest that limiting the length of the bridge / size of island to 1 is just a special case - i.e.

  1. Long Bridge
  2. Long Island
  3. Length 1 Island
  4. Length 1 Bridge

The odd aspect of this "conceit" for describing reflow is that you don't know how long a bridge will be when you start building it.

When it comes to documentation I'm guessing I won't be the only person who needs pictures to understand this.

wxtim commented 2 years ago

Alt visualization:

IMG_20220222_080429.jpg

dpmatthews commented 2 years ago

Finally, maybe --flow= should be a required option, to force users to think about what they want

We have to remember that this discussion only applies to tasks not currently in the pool (*). If you, for example, trigger a failed task that should just run that task as part of the current flow and shouldn't require you to specify any additional options (and, if you do specify any options they should be ignored).

(*) Not sure what terminology to use. This certainly applies to a task with incomplete outputs. Also to a) a queued task, b) a task waiting for a clock trigger or an xtrigger, c) what else? We need to make sure this is clearly documented. We also need to describe what happens if you trigger a submitted or running task (the trigger is ignored).

because getting it wrong could be costly

I think this is why @oliver-sanders wants the default to be the least dangerous option (a one-one task?) - you shouldn't trigger a new flow without choosing the required behaviour.

hjoliver commented 2 years ago

@wxtim -

When it comes to documentation I'm guessing I won't be the only person who needs pictures to understand this.

Yes we'll definitely need diagrams!

I'm not keen on the "bridge" and "island" terminology for users (although I see what you're getting at).

hjoliver commented 2 years ago

We have to remember that this discussion only applies to tasks not currently in the pool (*).

Yes, I was taking that as a given at this point.

I think this is why @oliver-sanders wants the default to be the least dangerous option (a one-[OFF] task?) - you shouldn't trigger a new flow without choosing the required behaviour.

Yes, but my latest suggestion is even safer than having a "safest default": there is no default, and you can't trigger a new flow without using --flow=new. I'm not entirely averse to having the safest option as a default, but even that is arguably dangerous.

For example, users who blithely do a cylc trigger without considering what that means might not find out for quite some time that the upcoming flow ignores their triggered task.

hjoliver commented 2 years ago

If you, for example, trigger a failed task that should just run that task as part of the current flow and shouldn't require you to specify any additional options (and, if you do specify any options they should be ignored).

We can document that as If a task already belongs to a flow, triggering it will just run it sooner as part of its own flow.

Then we just have to make the "already belongs to a flow" bit obvious in the UI.

As for "shouldn't have to specify any options" (if already in the pool): that requires some thought (for the CLI at least).

oliver-sanders commented 2 years ago

Can I confirm that my sketch matches the four Options

Not sure about the first one, however, your "Alt visualizations" are spot on :100: (and exactly how I would want to present this to users). We should be able to do this in ASCII for the CLI --help.

Island & Bridge

Kinda get what you're going for there but Cylc "islands" are highly unstable:

Perhaps "sub-marine sea mounts"?

Well your case 1 above does suggest those words as CLI options; and you did describe the whole thing in those terms

Note: Purposefully using new terminology to avoid conflation with existing terms, we may want to workshop "continue" and "overrun" a touch.

I picked these awkward terms because otherwise we would have gone back around the terminology loop of the previous week again, please consider them as abstract concepts for which we can choose new labels.

Irrespective of how the CLI ends up, explaining these four trigger spaces will ultimately involve explaining the behaviours which form the matrix (since the matrix defines the fundamental differences) so we can do with hashing out options.

How about a small variation on the --flow= suggestions ... or --wait (wait for catch and merge before continuing).

Yes so we should define the interface from a description of its behaviour from the user's perspective (your --wait being the opposite of --continue).

If we add another option to reflect the "overrun" dimension, say --join then we are back to CLI option (1).

(I had considered this opposite arrangement, however, it makes reflow the default which is definitely not a good idea!)

CLI options (1) & (2) are both viable, however, I don't see a good reason to mix them.

Yes, but my latest suggestion is even safer than having a "safest default": there is no default

I don't think this is a good idea, especially from the UI perspective where it would be confusing as heck. The best option is to come up with a consistent behaviour that can apply across the whole range.

I think that's option (4) (i.e. continue=no, rerun=no) (i.e. --wait --join with the syntax above).

The full range of options should only need to be known to the very small proportion of users who would actually want to use them (i.e. keep the complexity away from the general case).

oliver-sanders commented 2 years ago

CLI option (2) single --flow argument.

# 1) reflow
cylc trigger --flow=new

# 2) continue
cylc trigger --flow=<TBC>

# 3) no-flow (implemented)
cylc trigger --flow=none

# 4) no-flow (proposed)
cylc trigger --flow=wait

CLI option (4) wait/join

Use the opposites of continue/rerun.

wait Wait for catch and merge before continuing.

join Tasks are considered to "belong" to other flows.

There are two options for the defaults:

a) Yes (wait and join default to "yes" so must be disabled manually) b) No (wait and join default to "no" so must be enabled manually)

# 1) reflow
a) cylc trigger --wait=no --join=no  # or --no-wait --no-join
b) cylc trigger

# 2) continue
a) cylc trigger --wait=no  # or --no-wait
b) cylc trigger --join

# 3) no flow (implemented)
a) cylc trigger --join=no  # or --no-join
b) cylc trigger --wait

# 4) no-flow (proposed)
a) cylc trigger
b) cylc trigger --wait --join 
hjoliver commented 2 years ago

@wxtim - "alt visualization", very nice :+1:

hjoliver commented 2 years ago

@oliver-sanders -

I picked these awkward terms because otherwise we would have gone back around the terminology loop of the previous week again, please consider them as abstract concepts for which we can choose new labels.

Yes I do understand that now, it's just that I (naturally IMO!) initially assumed you meant we might need to workshop better names (in our context) for the concept "continue" interpreted in the normal sense of the word. But nevermind, we are now on the same page on that one :grin:

Yes, but my latest suggestion is even safer than having a "safest default": there is no default

I don't think this is a good idea, especially from the UI perspective where it would be confusing as heck. The best option is to come up with a consistent behaviour that can apply across the whole range.

I agree that having no default is not ideal, especially for the (non-CLI) UI. But I'm not convinced yet that the any of the options constitute a safe default. Triggering tasks with or without reflow and with or without merging is more complex and more consequential than most (any?) other action. Taking cylc stop for example, the default makes perfect sense because you can change your mind at any point after issuing the command if you decide you really wanted a quick shutdown. You can't necessarily bail out like that if you've just triggered the wrong kind of action in terms of tasks or flows running.

The safest default is to trigger a one-off task, because that doesn't cause reflow and doesn't mess with existing flows. BUT that probably isn't what users will want most of the time.

hjoliver commented 2 years ago

CLI options (1) & (2) are both viable, however, I don't see a good reason to mix them.

I have a different spin on what you're calling mixing the options, and the "behaviour-driven" nature of option 1 (overrun/continue) and the four-way matrix.

I think a description in terms of the higher level "flow" concept explains those behaviours and in fact unifies two of them. And a good conceptual framework should always be easier to understand than a bunch of separate behaviours.


  1. A flow is a single run of the workflow propagating through the graph and Cylc 8 can support multiple flows

  2. Flows do not interact (unless they clash in n=0) so: flows overrun other flows, and they can be overrun by other flows

  3. When you manually trigger a task (outside of n=0) you can:

    1. run it as a one-off task, independent of any flow so: no continue, can be overrun
    2. or start a new flow so ("flows do not interact"): continue, can overrun, can be overrun
    3. or pre-run part of an existing flow or flows and you can choose to continue now or on catch-up, but either way it is the same flow so: existing flows will not overrun these tasks, because they already ran in this flow

The bits in bold are all that users need to understand this, and they are pretty clean and intuitive concepts.

So from a flow-centric perspective, under 3., there are only three fundamental options, not four as in your table; with a choice of continue-now or continue-on-catchup in the pre-run case.

A flow-centric perspective should not be problematic for users only interested in running one flow. They may still need (i) and (iii) and so they need to understand the difference between triggering a future task "as part of my flow" or "independent of my flow".

So this is obviously pretty close to your CLI option (2) single --flow argument, but there's one problem:

# 4) no-flow (proposed)
cylc trigger --flow=wait

We need to be to able to combine "wait" with different flow options: --flow=any or --flow=1,2 etc. Perhaps most of the time any can be assumed (or could be the default), but disallowing the flow-specific option would be a bold call.

oliver-sanders commented 2 years ago

Yes, but my latest suggestion is even safer than having a "safest default": there is no default

I don't think this is a good idea, especially from the UI perspective where it would be confusing as heck. The best option is to come up with a consistent behaviour that can apply across the whole range.

I agree that having no default is not ideal, especially for the (non-CLI) UI. But I'm not convinced yet that the any of the options constitute a safe default

I think no-flow option (4) is fine.

Triggering a task is expected to have consequences, same for n=1 same for n=21, same for Cylc 7, same for Cylc 8. This consequence is obvious and easily understood, if you tell something to run, it will run, the fact it ran will persist.

hjoliver commented 2 years ago

I think no-flow option (4) is fine.

OK, I agree with the following:

BUT:

If re-running is the more common use case, that suggests "no-flow option (3)" as the natural default. With the downside, however, that the default triggering behaviour would not be consistent in and out of n=0 (which albeit only affects pre-running).

So, in proposing no-flow (4) as the default, are you disagreeing on which use case is the more common, or do you just think n=0 consistency is more important? And/or that multiple flows will be uncommon, so inadvertent merging with a previously re-run task is not something that most users will have to worry about?

Note I'm not entirely against going with (4) as the default - it is only a default after all - just want to make sure we're on the same page regarding the cons as well as the pros. And it does seem kinda wrong to me to have a default that causes problems with future flows in the re-run case, particularly if I'm right that that is the more common use case.

Possible solution?

I guess we could largely mitigate the re-run problem (for option 4) by only merging with flows that already exist at the time of triggering. It's a reasonable assumption that pre-running would only be done for existing flows, and most re-running will probably not be done with other flows coming up behind. This would be like Dave's option (2) but with continue-later instead of continue-now. This still amounts to a default assumption that all triggering is for pre-running tasks in the next flow, but at least it will ignore flows that haven't been started yet.

hjoliver commented 2 years ago

By the way, the consistency argument is somewhat in the eye of the beholder . If you trigger a task once behind the flow, it runs twice unless we perversely default to not running it. That's not consistent with running a task only once if triggered once in front of a flow.

Of course if you trigger a task that already ran, you ought to know that it already ran and are choosing to trigger it anyway ... but still, it doesn't really equate to consistent behaviour wherever you trigger a task, unless you assume all triggering is for pre-running and not re-running, which hardly seems justified.

(However, "no-flow option (4)" in front of a flow does have the advantage of hiding the task pool better.)

hjoliver commented 2 years ago

Another thing in the back of mind, on the "flow centric" perspective: once we support entirely independent non-merging flows, different flows may well be doing different things, such as processing a different batch of data through the same graph. Then a) having multiple flows at once may become common; and b) any manual triggering may need to be flow-specific, at least as an option.

oliver-sanders commented 2 years ago

Summary

I think we are largely in agreement, if we implement a separate field for the "wait" status of a task (rather than using flow numbers to communicate this) then your modified option (4) is could work and I think would be pretty nice.

I had considered adding this field as too difficult because it would require:

  • Changes in the database (to persist the wait=yes/no status).
  • Changes to the Protobuf and GraphQL schema (to provide this information to the UI).
  • A way to visualise this info (cannot be a badge like the pause icon because we need that info too, perhaps we just colour code it like other flows and list it separately in the key with "(wait)" or something like that to differentiate it, will have a think)

But it is do-able.

are you disagreeing on which use case is the more common

That wasn't my point, however, I think it's debatable which is likely to be more common. In development triggering historical tasks as part of an edit/reinstall/reload/trigger loop might bump that use case up the list. Under normal circumstances I would expect it to be less common.

Here are the main cases in the order of importance I would expect:

  1. Triggering tasks with incomplete outputs - n=0
    • (mostly failed & submit-failed tasks).
    • This is the main use case by a country mile.
  2. Skipping queues / runahead limit - n=0
  3. Skipping dependencies - n>1
    • (pre-running in your terminology)
  4. Re-running arbitrary historical tasks - n>1
    • (i.e. succeeded tasks where succeeded is a compulsory output or failed tasks where failed is an optional output)
    • Most likely as part of an edit/reinstall/reload/trigger loop in order to test changes (but where the task didn't fail).
    • Otherwise more likely to be done as a reflow since re-running a single task probably won't change its result?
  5. Re-running historical "switch tasks" - n<1
    • (where n<0 is the n>0 window projected backwards in time)
    • (i.e. tasks with optional outputs which cause graph branching, likely to use custom task outputs)
    • Much more likely to be done as a reflow since the user is trying to overrun the graph?

(would rather not haggle over the exact order of these use cases, different users, different uses, different orders)

or do you just think n=0 consistency is more important?

Yes, I think consistency is critical.

The bulk of user's trigger experience will be with the n=0 window, trigger needs to behave the same outside of that window otherwise it's going to get very confusing.

And/or that multiple flows will be uncommon, so inadvertent merging with a previously re-run task is not something that most users will have to worry about?

Yes!

We expect reflow to be a rarely used feature which most users won't be aware of so we would expect this to be a niche situation that only a handful of the the most advanced users are likely to encounter.

Most of the complication in this issue is a result of the complexity of supporting reflows, which is fine, however, I think that additional complexity should be opt-in I.E. users who don't use reflows should not be impacted by the consequences of this new Cylc feature. I think we should tuck this stuff behind well documented options to keep the general case simple.

just want to make sure we're on the same page regarding the cons as well as the pros.

I think so.

I think we have one issue which is where an n<0 task is triggered before a reflow trigger is started from an earlier task.

I think this is a very niche set of circumstance. Reflows are an advanced feature so I am not too worried about this.

(if the triggers are run the other way around it's no problem because the n<0 task can now be considered n>0 from the perspective of the new flow).

I agree the approach of the continue trigger (type 2) should resolve this which would definitely be nice.

Possible solution?

I guess we could largely mitigate the re-run problem (for option 4) by only merging with flows that already exist at the time of triggering.

Would be nice. The reason I didn't go down this route for option (4) was because specifying flow numbers would infer wait=false.

It's a reasonable assumption that pre-running would only be done for existing flows, and most re-running will probably not be done with other flows coming up behind.

Very reasonable I think.

By the way, the consistency argument is somewhat in the eye of the beholder

I don't agree with this argument and it is equally true of the n=0 window as the n<0 one.

Trigger is an action, what is the consequence of that action:

(where n<0 is the n>0 window projected backwards in time)

The square brackets around the [more] is not an inconsistency. The number of times this task runs will increase by one.

(going back to an older comment about bailing out of a trigger you didn't mean to issue)

You can't necessarily bail out like that if you've just triggered the wrong kind of action in terms of tasks or flows running.

Quick thought, probably foolish, how about expanding the scope of cylc remove to n>0 tasks? If insert and trigger can work beyond n=0 why not remove too?

hjoliver commented 2 years ago

OK, I think I can see exactly where/why we sort-of-disagree now.

Here are the main cases in the order of importance I would expect:

I certainly agree that your first two cases are ahead by a country mile, but I was taking that as a given because the result of triggering those is fixed. So I've been talking about the other cases, where we have a choice between continue/wait and flow/no-flow. So the only question here is whether or not pre-running (/skipping dependencies) will be more common than re-running, and relatedly whether or not that should affect what we take as the default behaviour. And neither of us has hard data on that. So I guess I should go with the consensus - you and Dave outvote me :grimacing:

For the record, a few things that I don't entirely agree with, although I see where you're coming from:

I.E. users who don't use reflows should not be impacted by the consequences of this new Cylc feature.

I don't think reflow is a difficult concept that most users need to be protected from. Basically it's as simple as this:

(Also talk of this "complex new feature" being foisted upon users rather glosses over the fact that Cylc 7 does have some aspects of this behaviour, only in a vastly less consistent and comprehensible way than any of the options on the table here. Some "reflow" would occur automatically if there happened to be waiting tasks in the pool downstream - so anyone triggering a task had to be aware of this - and occasionally users had to go to great lengths - with cylc insert and cylc remove - to actually make that happen, or to prevent it).

Also, I've been less concerned about consistency between n=0 and n>0 (in the sense that you mean) because the distinction has a straightforward conceptual explanation: waiting tasks in n=0 are ready to run according to the task graph and thus already assigned to the flow (when spawned by their parent tasks) so triggering just makes them run a bit earlier. Tasks in n>0 are just abstract graph nodes so you naturally have more options there (trigger a new flow, or a one-off task, or assign to an existing flow). Further, we will soon (I hope!) be able to easily identify n=0 tasks, including incomplete tasks, in the UI. (It's already not too bad in the UI: e.g. runahead and queue limiting are identified (and those are n=0 features).

hjoliver commented 2 years ago

[UPDATE: agreed in project meeting, for implementation]

That (previous comment) being said, having discussed the pros and cons up the wahzoo, we do need a default and there's no perfect option, so I guess it might as well be your preferred one.

If waiting to merge with the next flow will be the default, we should probably reverse the switch, to flag not-waiting instead.

Combine that with the --flow= approach which is:

And we get:

# DEFAULT: run task but wait for next (any) existing flow before continuing:
# [UPDATE: trigger task gets all existing flow numbers]
cylc trigger 

# wait for flow 1 or 2 before continuing (other flows can overrun):
cylc trigger --flow=1,2 

# run a flow-independent one-off task:
cylc trigger --flow=none 

# start a new flow:
cylc trigger --flow=new

# NO-wait MERGE OPTION

# continue immediately will all *existing* flow numbers (other newer flows can overrun)
cylc trigger --no-wait

# continue immediately with existing flow numbers 1 and 2 (other flows can overrun)
cylc trigger --no-wait --flow=1,2

AND:

# DOCUMENT the hell out of this: to RE-RUN a task (or a subgraph) without
# interfering with subsequent flows, use `--flow=none` (or `--flow=new`)

I quite like singling out --no-wait as special, because triggering (essentially) a new flow that is part of an existing flow seems like the most extreme thing to do.

hjoliver commented 2 years ago

(Deleted two off-piste comments on cylc remove)

dwsutherland commented 2 years ago

Finally up to speed with this...

With reference to the issue description types: type-2 - Continue-Yes, Overwrite-No. type-4 - Continue-No, Overwrite-No. Both to be part of w/e flow comes next for future n>0, having all existing flow-nums (+extra for both(?)).

A few comments on the defaults:

Here are the main cases in the order of importance I would expect:

  1. Triggering tasks with incomplete outputs - n=0

    • (mostly failed & submit-failed tasks).
    • This is the main use case by a country mile.
  2. Skipping queues / runahead limit - n=0

For n=0 I would expect trigger to behave in a type-2 fashion by default, but here's where type-2 and type-4 converge (so there's no difference).. As it doesn't need to wait for anything. Technically - This is obviously adopting the flow number of, and acting on, the task in the active pool (no need to adopt wider flow numbers).

  1. Skipping dependencies - n>1

    • (pre-running in your terminology)

Would this be for tasks of a flow id with no prior submit? or tasks with no submits? I would say:

I would be happy with either type-2 or type-4 here... I actually think type-2 would feel more natural, even if it's more dangerous... With a --wait flag to change to type-4, being a n>0 task it would (like type-2) have all existing flow numbers (and the new?).. (makes more sense than the -2/None flow number, as it needs to be recognized by only the current flows)

Would the type-4 task spawn it's children held? (if any) Or will the task just disappear? (meaning we'll need to try load from the DB for every task (to check for old instances), or check the data-store)

  1. Re-running arbitrary historical tasks - n>1

    • (i.e. succeeded tasks where succeeded is a compulsory output or failed tasks where failed is an optional output)
    • Most likely as part of an edit/reinstall/reload/trigger loop in order to test changes (but where the task didn't fail).
    • Otherwise more likely to be done as a reflow since re-running a single task probably won't change its result?
  2. Re-running historical "switch tasks" - n<1

    • (where n<0 is the n>0 window projected backwards in time)
    • (i.e. tasks with optional outputs which cause graph branching, likely to use custom task outputs)
    • Much more likely to be done as a reflow since the user is trying to overrun the graph?

By default these should pick up the flow number(s) of the latest submit (?)

I think type-2 is fine as default for all cases, because if the future tasks of this flow id have already run then it will stop there (shouldn't continue anything).

It's probably just the flow number management/assignment that will differ between future n>0, n=0, and n<0/historical-n>0..

hjoliver commented 2 years ago

Summary of offline chat with @dwsutherland - he's largely on board with the agreed approach but would personally prefer to default to re-run (and therefore no merge) when triggering tasks behind a flow.. but agrees that both ways are valid choices and only the default ... feel free to correct me if I'm wrong David.

dwsutherland commented 2 years ago

Summary of offline chat with @dwsutherland - he's largely on board with the agreed approach but would personally prefer to default to re-run (and therefore no merge) when triggering tasks behind a flow.. but agrees that both ways are valid choices and only the default ... feel free to correct me if I'm wrong David.

Yes, the next submit of the most recent flow. Which would be intuitive, cover alternate paths (switches), and be consistent with a rerun of a failed task.

Users savvy with reflows can trigger with a specified flow number for historical (submit number > 0) tasks ahead of oncoming reflow.

This means users don't need to know about reflows to trigger historical, current, or future tasks.

(I would also vote for --no-wait to be the default and have a --wait option... As it seems kind of artificial to hold it for no reason, but I'm happy either way)

hjoliver commented 2 years ago

(I would also vote for --no-wait to be the default and have a --wait option... As it seems kind of artificial to hold it for no reason, but I'm happy either way)

@oliver-sanders @dpmatthews - on this point, can you just confirm that you want to triggered tasks to wait for merge by default? That was not my initial preference, and looking back I think I've based it on a single comment from Oliver: I think no-flow option (4) is fine, in response to my comment that none of the available options is ideal as a default.

And, presumably the reason is it will be less likely to scare users, c.f. flowing on immediately (but still merging on catch-up)?

If so, I don't think that's entirely unreasonable (it's just the default) ... but it is somewhat at odds with the Cylc philosophy (no barrier between cycles etc.) that all tasks can run as soon as they're able. On those grounds, as @dwsutherland suggests, no wait should be the default.

oliver-sanders commented 2 years ago

on this point, can you just confirm that you want to triggered tasks to wait for merge by default

Yes we would prefer (type 4), although we do see the argument for not waiting (type 2) also which I don't think is entirely unreasonable either.

I would consider the "wait" option to be safer because the scope of its consequences are much less and is more in line with the user experience in the n<0 realm. This way the additional behaviours must be explicitly requested, i.e. reflow and no-wait.

The --flow=all default is important, the --wait/--no-wait is less important. I'm ok with making --no-wait the default providing there is a way to recall the trigger if it was not intended (e.g. by adding a new flow number OR with cylc remove OR by another method).

oliver-sanders commented 2 years ago

Would this be for tasks of a flow id with no prior submit? or tasks with no submits? I would say: ... If there are prior submits then treat as historical (assign latest submit flow id).

The past of one flow is the future of another, consequently this approach would not be reflow compatible.

$ cylc trigger <workflow>//1/a
$ sleep 60   # let it run on a few cycles
$ cylc trigger --flow=new <workflow>//1/a  # start a reflow
$ cylc trigger <workflow>//3/a  # "pre-run" a task in the new flow

The --flow=all approach provides consistency here as the behaviour is the same whether or not the task was run before and weather or not there is a subsequent reflow running.

oliver-sanders commented 2 years ago

We are open to a --no-wait default, it is quite natural and matches the Cylc 7 insert + trigger behaviour.

It is nicer for historical "switch tasks", e.g. task failed last time around triggering a failure recovery branch, user fixes the issue and re-triggers the task in an attempt to activate the success branch.

hjoliver commented 2 years ago

Great, I think we'll switch back to a no-wait default then. That also lessens the impact of (in effect) assuming that by default all triggering is for pre-running rather than re-running (if triggering to re-run, we would not want to wait for an upcoming flow to merge before continuing).

hjoliver commented 2 years ago

So, to be implemented:

# Trigger task; continue immediately; any upcoming flow can merge on catch-up (other flows can overrun)
cylc trigger  # DEFAULT

# Trigger task; continue immediately; flows 1 and 2 can merge on catch-up (other flows can overrun)
cylc trigger --flow=1,2

# Trigger task; wait for merge with any existing flow before continuing (other flows can overrun):
cylc trigger -- wait

# Trigger task; wait for merge with flow 1 or 2 before continuing (other flows can overrun):
cylc trigger --wait --flow=1,2 

# Run a one-off flow-independent task:
cylc trigger --flow=none 

# Start a new flow (will overrun other flows):
cylc trigger --flow=new
oliver-sanders commented 2 years ago

Looks good, one issue with the above, the type (3) trigger:

# Run a one-off flow-independent task:
cylc trigger --flow=none 

Is missing the --wait argument:

# Run a one-off flow-independent task:
cylc trigger --flow=none --wait

Although I suppose we could infer it?

Relating this back to the OP

Re-rendering of Tim's diagrams with the proposed CLI options (defaults in square brackets):

rect11933

(These four trigger names are not a part of the proposal and should probably not make it past this issue)

oliver-sanders commented 2 years ago

Example 1 (n>0)

Re-producing the worked example from https://github.com/cylc/cylc-flow/issues/4686#issuecomment-1040063362

a => b => c => d => e => f

Lets start with task "a" running in the original flow, then trigger task "d" using each of the four methods.

1) Reflow

cylc trigger --flow=new

The two flows never meet in the pool so never merge.

2) continue

cylc trigger

The two flows never meet in the pool, however, they do meet in the DB. Because the triggered tasks belong to flow=1 they will not be overrun.

Note: I have not added a new flow number for control purposes inline with above comments, instead the user must be able to use cylc remove (or equivalent) to prevent the propagation of the flow if necessary.

3) no-flow implemented

cylc trigger --flow=none --wait

The triggered task leaves the pool. As it is not considered to belong to any flow it will be overrun.

4) no-flow proposed

cylc trigger --wait

Note: I have not added a new flow number for control purposes inline with above comments, instead the user must be able to use cylc remove (or equivalent) to prevent the propagation of the flow if necessary.

The triggered task leaves the pool. As it is considered to belong to all flows running at the time of the trigger it will not be overrun.

oliver-sanders commented 2 years ago

Example 2 (n<0)

a => b => c => d

Lets start with task "c" running in the original flow, then trigger task "a" using each of the four methods.

The following show how the flows would look at the end of the graph execution:

1) Reflow

A new flow is started which overruns the previous flow.

2) Continue

The task "a" will get re-run by the trigger, however, the graph will not run on from there.

3) No Flow (implemented)

The task "a" will get re-run in the flow "none", this will not merge with the flow "1" so it should still appear as part of the flow "none" in the DB.

4) No Flow (proposed)

Trigger type (4) behaves exactly like type (2) for historical tasks.

oliver-sanders commented 2 years ago

Example 3 (n<0 switch)

a:x? => b => c
a:y? => d => e

Lets start with the "x" pathway completed and re-trigger "a".

We will assume that the second run of "a" yields the output "y".

The following show how the flows would look at the end of the graph execution:

1) Reflow

Task "a" is re-run, the workflow runs on from there following the "y" pathway.

2) Continue

Task "a" is re-run, the workflow runs on from there following the "y" pathway.

3) No-flow (implemented)

The triggered task runs, however, due to the --wait constraint, the graph does not run on from there.

4) No-flow (proposed)

The triggered task runs, however, due to the --wait constraint, the graph does not run on from there.

oliver-sanders commented 2 years ago

See what you make of the three examples above, if agreed, happy to turn them into test cases.

Clarification of the three flow "words":

There are two missing parts of the proposal to be covered in other issues.

  1. How to "stop" a --no-wait flow (possibly, cylc remove).
    • https://github.com/cylc/cylc-flow/issues/4728
    • We had considered adding a new flow number when triggering with --flow=all for control purposes.
    • However, this would pollute the flow history with a new number for each trigger.
    • Nice for provenance but really ugly.
  2. How to "resume" a --wait flow (possibly cylc set-outputs).
dpmatthews commented 2 years ago

cylc trigger --flow=none --wait Although I suppose we could infer it

I would prefer we infer --wait. In fact, I'm not convinced "wait" makes any sense in this case. You're not waiting for anything - you're just not continuing because there is no "flow" to continue.

See what you make of the three examples above, if agreed, happy to turn them into test cases.

They make sense to me - thanks for defining these.

Definition of all: I think using the set of all flow numbers in n=0 should be sufficient and it avoids reviving historical flow numbers.

hjoliver commented 2 years ago

Looks good, one issue with the above, the type (3) trigger: ... Is missing the --wait argument: .. cylc trigger --flow=none --wait

Although I suppose we could infer it?

I guess this reflects that you're thinking of "wait" as "don't continue immediately"? Like Dave, I'm thinking of it as waiting for another flow to catch up and merge before continuing, in which case the option simply doesn't make sense in the --flow=none case.

hjoliver commented 2 years ago

Relating this back to the OP

That's helpful, thanks.

I'm still not sure if your conitinued use of the the original terminology is suggesting we stick with that to dexcribe and document the behaviour though? [UPDATE: :rofl: I just saw this below the image: These four trigger names are not a part of the proposal and should probably not make it past this issue :rofl:]

In case you are, I think the terms reflow, continue and no-flow are too easily misinterpreted and need reworking.

  1. reflow: --flow=new So far, I've mostly been using "reflow" to refer to the general capability to flow onward from a manually triggered task. We/I probably should not use it that way, AND we should not say that a particular flow is or is not "a reflow" either. Instead:

    • Cylc supports multiple flows in the same graph
    • You can start new flows by manually triggering tasks
    • "reflow" is what happens if and when a flow traverses parts of the graph that already ran in a previous flow
    • A new flow is not exclusively "a reflow" or not. E.g. it might start as a reflow but then move out ahead into clean graph; and a flow might be "a reflow" with respect to one flow, but "a preflow" with respect to another.
  2. continue: [--flow=all --no-wait] (As previously discussed) "Continue" here means "continue immediately", but the implied opposite (to "not continue") could reasonably be expected to be (3) (--flow=none), not (4) which is better explained as "continue later" (on merge).

  3. no-flow: --flow=none Fine, a one-off task that does not start a flow

  4. no-flow: [--flow=all] --wait (As previously discussed) This should be described as triggering part of an existing flow early and continuing to flow on later (after catch-up and merge), and "no-flow" does not sound like a good name for that.

hjoliver commented 2 years ago

Clarification of the three flow "words":

Agreed. And n=0 flow numbers should do for --flow=all

See what you make of the three examples above, if agreed, happy to turn them into test cases.

Great way to illustrate what happens, in combination with @wxtim's diagrams.

One minor quibble on your description of case (3) ... related to "the missing --wait argument":

The triggered task runs, however, due to the --wait constraint, the graph does not run on from there.

I would say, the graph does not run on from there because --flow=none (it's a one-off task, not a flow).

And (sorry :grimacing: !) possibly a more fundamental issue with your proposed n<0 behaviour (which I now see is different to what I had thought you meant before)

Firstly, as an aside, the n=0 window is well defined, but n>0 and n<0 are not really, because the n=0 window is not tied to a contiguous section of the graph. Even with only a single active flow (e.g. after a trigger-no-wait operation) there can be triggerable tasks that have n=0 tasks in both directions in the graph from them.

Secondly, this (what you've labelled as) n<0 behaviour privileges the very first flow over all others, because the result (c.f. n>0) is determined by whether or not the triggered task ever ran before (in any flow). I don't like that much. Seems to me we should be able to make stuff happen somewhere in the graph without caring whether or not a flow went through it some time in the past.

2) Continue The task "a" will get re-run by the trigger, however, the graph will not run on from there.

flow:1 a#1 (the naturally triggered run) a#2 (the manually triggered run) b c d

So if "a" had never run before, the default trigger (2 Continue) will cause "a" to flow on as part of flow 1.

But if "a" had ever run before, the default trigger (which is supposed to "continue") will behave exactly like --flow=none except that the flow label is different (1 vs none).

Instead, I think we should have consistent behaviour in terms of does the triggered task/flow merge with upcoming flows and continue or not regardless of where we trigger in the graph (still merging with existing flows by default, btw).

NOTE this is why my "to be implemented" post above was short and did not distinguish between "n<0" and "n>0" - according to my view of how this should work, that's not necessary. I had not gone down to the specifics of flow numbers there, but here's what I was thinking...

hjoliver commented 2 years ago

This case is straightforward (one-off task, no flow-on, no merge with existing flows):

cylc trigger --flow=none

And this case is straightforward (new flow, no merge with existing flows):

cylc trigger --flow=new  

But we differ here:

cylc trigger  # with or without --wait

The triggered flow should get:

That's it, and the default-trigger behaviour and result is the same before and after any flow.


A few comments:

If we have flow=1 and default-trigger a task in front of it:

If we have flow=1 and default-trigger a task behind it:

Then wherever we default-trigger a task (outside of n=0 of course):

Other pros:

(Note that adding a new flow number isn't strictly necessary except when there are no existing flows that have not run the task already ... in which case the user is definitely doing a re-run, and there is no following flow to merge with).