cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

trigger: generalisation of triggering approaches #4686

Closed oliver-sanders closed 2 years ago

oliver-sanders commented 2 years ago

Related Issues:

If agreed this issue should supersede:

After a long chat with @dpmatthews (who proposed yet another triggering approach 😁) I think we can generalise the trigger problem into two dimensions:

Note: From the internal implementation these two dimensions may appear flip-sides of the same coin since they both boil down to the flow_nums, however, considering them from a user standpoint I think it's fair to prise them apart.

Note: Purposefully using new terminology to avoid conflation with existing terms, we may want to workshop "continue" and "overrun" a touch.

[1]: The quoted "merge" above relates to the interaction between two tasks with different flow_nums in general and not to the more specific concept of "flow merging" in the pool exclusively.

Combing these we get four spaces:

Continue Don't Continue
Overrun (1) Reflow (as currently implemented) (3) No Flow (current default trigger behaviour)
No Overrun (2) Continue (@dpmatthews new proposed implementation) (4) No Flow (@oliver-sanders proposed implementation)

Going through the four spaces in detail:

1) Reflow (implemented)

Equivalent to cylc trigger --flow=<new-flow-number>.

Continue: Yes Overrun: Yes

The use case is for re-running over tasks which have been previously run e.g. change configuration and re-run a sub-graph.

2) Continue (proposed)

Equivalent to cylc trigger --flow=<all-flow-numbers>,<new-flow-number>.

Continue: Yes Overrun: No

This approach feels quite "natural". The use cases are setting off another bit of the same flow where you don't want tasks to be overrun.

3) No Flow (implemented)

Equivalent to cylc trigger --flow -1.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present but would need to change the marker.

Continue: No Overrun: Yes

Useful for running one-off tasks that you do not want to impact the workflow in any way (i.e. cylc submit type uses).

4) No Flow (proposed)

Equivalent to cylc trigger --flow -2.

I am using a negative flow number rather than None to distinguish the two no-flow approaches. Internally we can still maintain the same no-flow logic as present.

Continue: No Overrun: No

Use case is for manually intervening in graph execution by ignoring dependencies or runahead limit and skipping ahead to a task which you want to be considered a part of the approaching flow front.

Interface

The internals to handle the four cases are already in-place, flow_nums, DB lookups etc, so it mostly boils down to an interface / documentation issue.

I think all four methods could be exposed via a single --flow argument, however, it is sensible to provide defaults for the different behaviours. I think it would be good to document the --flow equivalents as they may help users to understand their function.

Note that --reflow currently determines the new flow number server rather than client side which is sensible.

1) Enable behaviours explicitly

If we are happy with the continue/overrun model (after workshopping the terms) we could expose it directly something like:

# 1) reflow
cylc trigger --continue --overrun

# 2) continue
cylc trigger --continue

# 3) no-flow (implemented)
cylc trigger --overrun

# 4) no-flow (proposed)
cylc trigger

This is quite nice as you have to explicitly opt in to each behaviour separately reducing the scope for unintended results and accidents.

2) Single --flow argument

if we don't like the continue/overrun model we could move the presets into the flow argument something like:

# 1) reflow
cylc trigger --flow=new

# 2) continue
cylc trigger --flow=any

# 3) no-flow (implemented)
cylc trigger --flow=none

# 4) no-flow (proposed)
cylc trigger --flow=next

It's less behaviour driven so we would need to explain each option separately.

3) Separate flag for each approach

An alternative to (2) would be to could come up with three/four different flags:

# 1) reflow
cylc trigger --reflow

# 2) continue
cylc trigger --flow

# 3) no-flow (implemented)
cylc trigger --rerun

# 4) no-flow (proposed)
cylc trigger  # --run

Default

I think no-continue & no-overrun is the safest, sanest default because:

But I'm biased. I think the default is less important than the clear separation of behaviours.

oliver-sanders commented 2 years ago

@hjoliver it is not clear what you are proposing, please could you fill out the above examples with your desired behaviour and highlight where they differ.

You seem to be suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not in contradiction with:

Agreed. And n=0 flow numbers should do for --flow=all

hjoliver commented 2 years ago

You seem to be suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not in contradiction with:

Agreed. And n=0 flow numbers should do for --flow=all

Not really, I'm saying current active flows (i.e. those in n=0) should be sufficient, c.f. all flows recorded in the DB.

With the small caveat (which is probably what caused the confusion here, sorry) that we should exclude flow numbers of flows that have already passed through the triggered task. That is what allows the default trigger to re-run a sub-graph (say) behind a flow (because the triggered task will not take the flow number of the flow that we are re-running, even if that flow number still exists in n=0).

please could you fill out the above examples with your desired behaviour and highlight where they differ.

OK, I'll try to do that now, since we desperately need to lay this one to rest. I wonder if this is gonna end up the longest single issue page on the project :-)

hjoliver commented 2 years ago

suggesting the rules for what flow numbers are provided by --flow=all differ depending on whether the task has run before or not

Also, I'd say the rules are the exactly same in both cases, it's just that in the never-ran-before case there is no previous flow number to exclude.

oliver-sanders commented 2 years ago

we should exclude flow numbers of flows that have already passed through the triggered task

So if there is only one flow in the workflow the task will not run at all.

If there are multiple flows in the workflow the "continue" trigger will result in a reflow irrespective of whether the other flow(s) are ahead or behind of the original?

Examples would be great.

hjoliver commented 2 years ago

So if there is only one flow in the workflow the task will not run at all.

No, see this comment:

and a new flow number (in case there are no existing flows that have not used the task already)

oliver-sanders commented 2 years ago

Ok, so this effectively changes to default to reflow for historical tasks.

I would much prefer for reflows to require users to opt-in in all cases because the consequences of reflow on users data are quite dangerous and reflow (and multiple flows in general) are way beyond what we can expect of the working knowledge of the vast majority of users.

hjoliver commented 2 years ago

If there are multiple flows in the workflow the "continue" trigger will result in a reflow irrespective of whether the other flow(s) are ahead or behind of the original?

(See my terminology comments above on what exactly "reflow" means)

So I think "the continue trigger" should, by definition, "continue", which means a flow should carry on from the triggered task.

The main thing, which we agreed on, is that by default that continuing flow should not get overrun by any existing flows (and I'm not arguing with that).

hjoliver commented 2 years ago

Ok, so this effectively changes to default to reflow for historical tasks.

Meh, sort of. My way is simpler from a consistency perspective (same behaviour on triggering a task, whether or not it ever ran before), and I think what matters and is easier to understand is whether the triggered task flows on or not. The fact that flowing on after triggering an n>0 task is not technically a "reflow" will be lost on most users. It will look like a new flow to them (now we have the original flow, and this new one from where I triggered a task) ... the fact that it happens to have the right flow numbers so that the original flow won't overrun it on catch-up, or that it is "not a reflow" because those tasks never ran before, is secondary.

hjoliver commented 2 years ago

And my other related point is that if you are triggering a past task to re-run it, you are just as likely to want it to flow on (the regenerate some products use case), as opposed to running a single task.

The re-run a single task case seems to me to be best expressed by non-default --flow=none option. For two reasons: 1) you want to trigger a single task, not a flow; and 2) my "flow integrity" argument above: a flow is a self-perpetuating run through the graph, and the previous flow already passed by ... so why should the re-triggered task have the same flow number?

hjoliver commented 2 years ago

I would much prefer for reflows to require users to opt-in in all cases because the consequences of reflow on users data are quite dangerous and reflow (and multiple flows in general) are way beyond what we can expect of the working knowledge of the vast majority of users.

I don't disagree that "reflow is dangerous" in the sense that it re-runs tasks and that will probably overwrite existing data. However:

  1. re-running a single task with no flow-on does that too; if you re-run anything you have to be aware of that consequence
  2. the graph shows what is supposed to happen downstream of any task, so it should not be very surprising if that happens unless you tell it not to. It is not so uncommon for Cylc 7 users to expect it to happen and then to struggle to understand how to make it happen via the nightmare of cylc inserting multiple waiting tasks in the right order.
  3. I don't think we should significantly complicate the conceptual flow model by going to lengths to avoid reflow

At least I think we probably both understand where the other is coming from now.

Because I was focused more on consistent triggering behaviour, when you agreed to go back to the no-wait default I thought that applied equally to future and past tasks. i.e. no-wait in front of flow=1 means "flow on now" (with all current flow numbers that could catch up and merge); and no-wait behind flow=1 means exactly the same thing.

Both generate a new flow front. The fact that one case involves re-running past tasks should be blindingly obvious to users because they deliberately triggered a task that already ran.

hjoliver commented 2 years ago

If you're not coming around to my perspective (which again, makes for simpler, consistent triggering behaviour and does not treat flow=1 as magic [SPECIAL]) then I suppose one way out of this bind is to revert to "wait" as a default. I'd rather not do that because a) it artificially constrains the workflow; and b) if it behaves as you want for re-running tasks, it makes the "wait" concept harder to understand (easy: wait for existing flows to catch up before continuing; weird: if only flow=1 exists and we trigger behind it, what are we "waiting" for??)

hjoliver commented 2 years ago

Example 1 (n>0)

(SAME RESULT in all cases)

Example 2 (n<0)

1) Reflow

SAME RESULT (A new flow is started which overruns the previous flow.)

2) Continue

DIFFERENT RESULT: same as 1) Reflow

The task "a" will get re-run by the trigger, and the graph WILL run on from there (that's what "continue" and "no wait" means)

3) No Flow (implemented)

SAME RESULT

4) No Flow (proposed)

DIFFERENT RESULT: still same as type (2), but now that is the same as reflow rather than no-flow

oliver-sanders commented 2 years ago

If you're not coming around to my perspective (which again, makes for simpler, consistent triggering behaviour and does not treat flow=1 as magic)

Disagree on "simpler", "consistent" and "magic" 😁.

You're not winning me over I'm afraid. I see your points, but I don't agree with them. Since the start I've maintained that defaulting to reflow is dangerous and that all reflow functionality (and all its complex consequences e.g. no-flow) should be opt-in.

You are proposing that --flow=all can actually mean, all flows OR all flows and a new one minus an existing one OR a just new flow, which isn't especially consistent.

If I understand correctly what you are proposing does not add any new functionality, it just changes the default. If so my interpretation covers all bases, but if you want a reflow you must manually say so.

hjoliver commented 2 years ago

You are proposing that --flow=all can actually mean, all flows OR all flows and a new one minus an existing one

That's kind of a misrepresentation because it ignores the definition of flow. A flow is a self-consistent self-perpetuating run through the graph. If a flow has passed by a task, retriggering it should be considered a new flow (or a one-off no-flow), because by definition that task has already run in that flow. You are saying, give the task the same flow number it had before but run it anyway, even though it has already run in that flow.

OR a just new flow, which isn't especially consistent.

My consistency is at the conceptual level. When you trigger a task, any task, does it flow on or not. This supposed inconsistency is down at the level of flow numbers which is really an implementation detail that we use to make the required behaviours work.

hjoliver commented 2 years ago

If I understand correctly what you are proposing does not add any new functionality, it just changes the default. If so my interpretation covers all bases, but if you want a reflow you must manually say so.

That's right, but we are coming from two different flow models (in a sense). By my conceptual model (which I'm claiming is simpler) your default is different behind the first flow than it is in front of it. (And it doesn't even seem to make sense with respect to the names that you gave the options: behind flow=1 the "continue" / no-wait default does not actually continue anything.)

oliver-sanders commented 2 years ago

I don't think we are going to get anywhere with this, suggest another call.

oliver-sanders commented 2 years ago

(otherwise it's going to be another ten pages of reply, quote and response)

hjoliver commented 2 years ago

Yep, can do :+1:

hjoliver commented 2 years ago

OK, meeting done. Result: I concede defeat. :boom: Reasons, for the record:

Also, on terminology:

hjoliver commented 2 years ago

The final result then, for implementation.

(@oliver-sanders' explicit examples above are all valid and useful, and should be made into tests, but I think we can ditch the four-way categorization at this point).

Trigger Active Flows

cylc trigger [--wait]

The triggered task runs with the set of all active (n=0) flow numbers, A

(It gets a bit gnarly to list exactly what happens when triggering ahead of all flows, behind all flows, and between flows ... but we don't need to do that here as it's all derivable from the above).

Trigger Specific Flows

cylc trigger --flow=1,2 [--wait]

The triggered task runs with the specified set of flow numbers, S = {1,2}

Trigger a New Flow

cylc trigger --flow=new

The triggered task runs with a new flow number, not in the set of active flows A (or any previous flow in fact).

Trigger No Flow

cylc trigger --flow=none 

The triggered task runs with a "none" flow number.