trigger: generalisation of triggering approaches

oliver-sanders commented 2 years ago

Related Issues:

If agreed this issue should supersede:

https://github.com/cylc/cylc-flow/issues/4657

https://github.com/cylc/cylc-flow/issues/4653

After a long chat with @dpmatthews (who proposed yet another triggering approach 😁) I think we can generalise the trigger problem into two dimensions:

Continue (yes/no).
- After I trigger the task will the flow continue from that point immediately.
- Or does it only continue if/when a flow front catches up with it.
- I.E. Should the triggered tasks spawn children on completion or after "merge".
Overrun (yes/no).
- Should the "merge" [1] condition be based on the pool or the DB?
- I.E. Should triggered tasks overrun previous runs of tasks?
- I.E. Should the following flow overrun the triggered tasks?

Note: From the internal implementation these two dimensions may appear flip-sides of the same coin since they both boil down to the flow_nums, however, considering them from a user standpoint I think it's fair to prise them apart.

Note: Purposefully using new terminology to avoid conflation with existing terms, we may want to workshop "continue" and "overrun" a touch.

[1]: The quoted "merge" above relates to the interaction between two tasks with different flow_nums in general and not to the more specific concept of "flow merging" in the pool exclusively.

Combing these we get four spaces:

	Continue	Don't Continue
Overrun	(1) Reflow (as currently implemented)	(3) No Flow (current default trigger behaviour)
No Overrun	(2) Continue (@dpmatthews new proposed implementation)	(4) No Flow (@oliver-sanders proposed implementation)

The bad news is it looks like we have use cases for all four.
Dave & I think the no-overrun cases are more important than the overrun ones.
The good news is that they can coexist and the mechanism for supporting all four is currently implemented, it's mostly an interface problem.

Going through the four spaces in detail:

1) Reflow (implemented)

Equivalent to cylc trigger --flow=<new-flow-number>.

Continue: Yes Overrun: Yes

Tasks are triggered with a new flow number.
The reflow can overrun previous flows.
The reflow will merge if it collides with another flow in the pool (and only in the pool i.e. overrun).

The use case is for re-running over tasks which have been previously run e.g. change configuration and re-run a sub-graph.

2) Continue (proposed)

Equivalent to cylc trigger --flow=<all-flow-numbers>,<new-flow-number>.

Continue: Yes Overrun: No

A new trigger approach proposed by @dpmatthews.
Tasks are triggered with all existing flow numbers plus a new flow number (which we added purely so the new flow can still be targeted by CLI tools).
Because this flow contains all existing flow numbers it will not be overrun by any of the flows which exist at the time of the trigger.
This is intended for the sort of use cases we would expect --flow=1 to be used for, but has been generalised to be reflow compatible.

This approach feels quite "natural". The use cases are setting off another bit of the same flow where you don't want tasks to be overrun.

3) No Flow (implemented)

Equivalent to cylc trigger --flow -1.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present but would need to change the marker.

Continue: No Overrun: Yes

Useful for running one-off tasks that you do not want to impact the workflow in any way (i.e. cylc submit type uses).

4) No Flow (proposed)

Equivalent to cylc trigger --flow -2.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present.

Continue: No Overrun: No

Use case is for manually intervening in graph execution by ignoring dependencies or runahead limit and skipping ahead to a task which you want to be considered a part of the approaching flow front.

Interface

The internals to handle the four cases are already in-place, flow_nums, DB lookups etc, so it mostly boils down to an interface / documentation issue.

I think all four methods could be exposed via a single --flow argument, however, it is sensible to provide defaults for the different behaviours. I think it would be good to document the --flow equivalents as they may help users to understand their function.

Note that --reflow currently determines the new flow number server rather than client side which is sensible.

1) Enable behaviours explicitly

If we are happy with the continue/overrun model (after workshopping the terms) we could expose it directly something like:

# 1) reflow
cylc trigger --continue --overrun

# 2) continue
cylc trigger --continue

# 3) no-flow (implemented)
cylc trigger --overrun

# 4) no-flow (proposed)
cylc trigger

This is quite nice as you have to explicitly opt in to each behaviour separately reducing the scope for unintended results and accidents.

2) Single `--flow` argument

if we don't like the continue/overrun model we could move the presets into the flow argument something like:

# 1) reflow
cylc trigger --flow=new

# 2) continue
cylc trigger --flow=any

# 3) no-flow (implemented)
cylc trigger --flow=none

# 4) no-flow (proposed)
cylc trigger --flow=next

It's less behaviour driven so we would need to explain each option separately.

3) Separate flag for each approach

An alternative to (2) would be to could come up with three/four different flags:

# 1) reflow
cylc trigger --reflow

# 2) continue
cylc trigger --flow

# 3) no-flow (implemented)
cylc trigger --rerun

# 4) no-flow (proposed)
cylc trigger  # --run

Default

I think no-continue & no-overrun is the safest, sanest default because:

The minimum set of behaviours is the simplest.
The "Continue" cases have a dramatic impact on the workflow execution and are hard to revoke.
The "Re-run" cases are quite advanced and require additional knowledge to operate.

But I'm biased. I think the default is less important than the clear separation of behaviours.

oliver-sanders commented 2 years ago

@hjoliver it is not clear what you are proposing, please could you fill out the above examples with your desired behaviour and highlight where they differ.

You seem to be suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not in contradiction with:

Agreed. And n=0 flow numbers should do for --flow=all

hjoliver commented 2 years ago

You seem to be suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not in contradiction with:

Agreed. And n=0 flow numbers should do for --flow=all

Not really, I'm saying current active flows (i.e. those in n=0) should be sufficient, c.f. all flows recorded in the DB.

With the small caveat (which is probably what caused the confusion here, sorry) that we should exclude flow numbers of flows that have already passed through the triggered task. That is what allows the default trigger to re-run a sub-graph (say) behind a flow (because the triggered task will not take the flow number of the flow that we are re-running, even if that flow number still exists in n=0).

please could you fill out the above examples with your desired behaviour and highlight where they differ.

OK, I'll try to do that now, since we desperately need to lay this one to rest. I wonder if this is gonna end up the longest single issue page on the project :-)

hjoliver commented 2 years ago

suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not

Also, I'd say the rules are the exactly same in both cases, it's just that in the never-ran-before case there is no previous flow number to exclude.

oliver-sanders commented 2 years ago

we should exclude flow numbers of flows that have already passed through the triggered task

So if there is only one flow in the workflow the task will not run at all.

If there are multiple flows in the workflow the "continue" trigger will result in a reflow irrespective of whether the other flow(s) are ahead or behind of the original?

Examples would be great.

hjoliver commented 2 years ago

So if there is only one flow in the workflow the task will not run at all.

No, see this comment:

and a new flow number (in case there are no existing flows that have not used the task already)

oliver-sanders commented 2 years ago

Ok, so this effectively changes to default to reflow for historical tasks.

I would much prefer for reflows to require users to opt-in in all cases because the consequences of reflow on users data are quite dangerous and reflow (and multiple flows in general) are way beyond what we can expect of the working knowledge of the vast majority of users.

hjoliver commented 2 years ago

If there are multiple flows in the workflow the "continue" trigger will result in a reflow irrespective of whether the other flow(s) are ahead or behind of the original?

(See my terminology comments above on what exactly "reflow" means)

So I think "the continue trigger" should, by definition, "continue", which means a flow should carry on from the triggered task.

The main thing, which we agreed on, is that by default that continuing flow should not get overrun by any existing flows (and I'm not arguing with that).

hjoliver commented 2 years ago

Ok, so this effectively changes to default to reflow for historical tasks.

Meh, sort of. My way is simpler from a consistency perspective (same behaviour on triggering a task, whether or not it ever ran before), and I think what matters and is easier to understand is whether the triggered task flows on or not. The fact that flowing on after triggering an n>0 task is not technically a "reflow" will be lost on most users. It will look like a new flow to them (now we have the original flow, and this new one from where I triggered a task) ... the fact that it happens to have the right flow numbers so that the original flow won't overrun it on catch-up, or that it is "not a reflow" because those tasks never ran before, is secondary.

hjoliver commented 2 years ago

And my other related point is that if you are triggering a past task to re-run it, you are just as likely to want it to flow on (the regenerate some products use case), as opposed to running a single task.

The re-run a single task case seems to me to be best expressed by non-default --flow=none option. For two reasons: 1) you want to trigger a single task, not a flow; and 2) my "flow integrity" argument above: a flow is a self-perpetuating run through the graph, and the previous flow already passed by ... so why should the re-triggered task have the same flow number?

hjoliver commented 2 years ago

I would much prefer for reflows to require users to opt-in in all cases because the consequences of reflow on users data are quite dangerous and reflow (and multiple flows in general) are way beyond what we can expect of the working knowledge of the vast majority of users.

I don't disagree that "reflow is dangerous" in the sense that it re-runs tasks and that will probably overwrite existing data. However:

re-running a single task with no flow-on does that too; if you re-run anything you have to be aware of that consequence
the graph shows what is supposed to happen downstream of any task, so it should not be very surprising if that happens unless you tell it not to. It is not so uncommon for Cylc 7 users to expect it to happen and then to struggle to understand how to make it happen via the nightmare of cylc inserting multiple waiting tasks in the right order.
I don't think we should significantly complicate the conceptual flow model by going to lengths to avoid reflow

At least I think we probably both understand where the other is coming from now.

Because I was focused more on consistent triggering behaviour, when you agreed to go back to the no-wait default I thought that applied equally to future and past tasks. i.e. no-wait in front of flow=1 means "flow on now" (with all current flow numbers that could catch up and merge); and no-wait behind flow=1 means exactly the same thing.

Both generate a new flow front. The fact that one case involves re-running past tasks should be blindingly obvious to users because they deliberately triggered a task that already ran.

hjoliver commented 2 years ago

If you're not coming around to my perspective (which again, makes for simpler, consistent triggering behaviour and does not treat flow=1 as ~~magic~~ [SPECIAL]) then I suppose one way out of this bind is to revert to "wait" as a default. I'd rather not do that because a) it artificially constrains the workflow; and b) if it behaves as you want for re-running tasks, it makes the "wait" concept harder to understand (easy: wait for existing flows to catch up before continuing; weird: if only flow=1 exists and we trigger behind it, what are we "waiting" for??)

hjoliver commented 2 years ago

Example 1 (`n>0`)

(SAME RESULT in all cases)

Example 2 (`n<0`)

1) Reflow

SAME RESULT (A new flow is started which overruns the previous flow.)

2) Continue

DIFFERENT RESULT: same as 1) Reflow

The task "a" will get re-run by the trigger, and the graph WILL run on from there (that's what "continue" and "no wait" means)

3) No Flow (implemented)

SAME RESULT

4) No Flow (proposed)

DIFFERENT RESULT: still same as type (2), but now that is the same as reflow rather than no-flow

oliver-sanders commented 2 years ago

If you're not coming around to my perspective (which again, makes for simpler, consistent triggering behaviour and does not treat flow=1 as magic)

Disagree on "simpler", "consistent" and "magic" 😁.

You're not winning me over I'm afraid. I see your points, but I don't agree with them. Since the start I've maintained that defaulting to reflow is dangerous and that all reflow functionality (and all its complex consequences e.g. no-flow) should be opt-in.

You are proposing that --flow=all can actually mean, all flows OR all flows and a new one minus an existing one OR a just new flow, which isn't especially consistent.

If I understand correctly what you are proposing does not add any new functionality, it just changes the default. If so my interpretation covers all bases, but if you want a reflow you must manually say so.

hjoliver commented 2 years ago

You are proposing that --flow=all can actually mean, all flows OR all flows and a new one minus an existing one

That's kind of a misrepresentation because it ignores the definition of flow. A flow is a self-consistent self-perpetuating run through the graph. If a flow has passed by a task, retriggering it should be considered a new flow (or a one-off no-flow), because by definition that task has already run in that flow. You are saying, give the task the same flow number it had before but run it anyway, even though it has already run in that flow.

OR a just new flow, which isn't especially consistent.

My consistency is at the conceptual level. When you trigger a task, any task, does it flow on or not. This supposed inconsistency is down at the level of flow numbers which is really an implementation detail that we use to make the required behaviours work.

hjoliver commented 2 years ago

If I understand correctly what you are proposing does not add any new functionality, it just changes the default. If so my interpretation covers all bases, but if you want a reflow you must manually say so.

That's right, but we are coming from two different flow models (in a sense). By my conceptual model (which I'm claiming is simpler) your default is different behind the first flow than it is in front of it. (And it doesn't even seem to make sense with respect to the names that you gave the options: behind flow=1 the "continue" / no-wait default does not actually continue anything.)

oliver-sanders commented 2 years ago

I don't think we are going to get anywhere with this, suggest another call.

oliver-sanders commented 2 years ago

(otherwise it's going to be another ten pages of reply, quote and response)

hjoliver commented 2 years ago

Yep, can do :+1:

hjoliver commented 2 years ago

OK, meeting done. Result: I concede defeat. :boom: Reasons, for the record:

I took "historical task" above to mean "ran previously in any flow", which makes flow=1 special (i.e. once flow=1 passes by, a task instantly becomes "historical" and stays that way) ... BUT we will only consider active flows at trigger time
I will give up on the absolute sanctity of a "flow" as (once triggered) a self-perpetuating run through the graph, because re-triggering a task with the old flow number does seem to provide the cleanest solution for re-running to hit a different branch.
And having given up on that, the pre-run vs re-run behaviour (in terms of flowing on or not after triggering) at least makes sense in terms of the assigned old flow number
And last but not entirely least:
- this whole discussion is only about what should be the default behaviour
- and this is a reasonably democratic project :grin:

Also, on terminology:

We agreed not to describe a flow as "a reflow" (or "not a reflow") for the reasons given above. The term may still be OK in this limited sense: if/when a flow overruns another flow it could be said to be "reflowing" that particular region of the graph.
We need a good way to describe the two different flow merge concepts: 1) flows with a common flow number will not overrun the same graph nodes; and 2) any flows will merge if they "collide" in n=0 ... (maybe "collision" is a good term for that).
I wasn't keen on --wait (or --no-wait) as an option name or concept because when re-triggering behind a flow you would almost never want to wait for an upcoming flow to merge and then continue. However:
- that objection no longer applies if we re-use the old flow number (which prevents reflow without requiring --wait)
- and even if it's not relevant most of the time, "wait for merge" does at least describe what will happen (by default) if an existing flow does catch up to a triggered task

hjoliver commented 2 years ago

The final result then, for implementation.

(@oliver-sanders' explicit examples above are all valid and useful, and should be made into tests, but I think we can ditch the four-way categorization at this point).

Trigger Active Flows

cylc trigger [--wait]

The triggered task runs with the set of all active (n=0) flow numbers, A

it will flow on if its children have not already been spawned in any member of A
- default: immediate flow-on
- --wait: flow on if/when members of A catch up and merge with it
- the triggered flow will merge with any member of A that catches it, or that it catches
- it will not merge with other flows (unless it is in the active set for a subsequent trigger event)
otherwise (if children already spawned in any member of A) it will not flow on
- (in which case --wait is meaningless)

(It gets a bit gnarly to list exactly what happens when triggering ahead of all flows, behind all flows, and between flows ... but we don't need to do that here as it's all derivable from the above).

Trigger Specific Flows

cylc trigger --flow=1,2 [--wait]

The triggered task runs with the specified set of flow numbers, S = {1,2}

(As for active flows, with A replaced by S)
(Niche power tool for experts, if needed)

Trigger a New Flow

cylc trigger --flow=new

The triggered task runs with a new flow number, not in the set of active flows A (or any previous flow in fact).

it flows on immediately
it will not merge with any member of A that catches it, or that it catches
it will not merge with other flows (unless it is in the active set for a subsequent trigger event)
(--wait is meaningless)

Trigger No Flow

cylc trigger --flow=none

The triggered task runs with a "none" flow number.

one-off task run, no flow-on
it will not merge with any flow, ever
(--wait is meaningless)

cylc / cylc-flow