cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
332 stars 93 forks source link

broadcast: modify runtime of a succeeded task so you can trigger it #6308

Open ColemanTom opened 2 months ago

ColemanTom commented 2 months ago

Discussion took place in https://cylc.discourse.group/t/cylc-broadcast-is-being-cleared-automatically/991/3

I'm posting this here to hope it doesn't get lost/forgotten.

Two of us encountered this same issue today. When you do a cylc broadcast on a task which has already run and succeeded, the broadcast is automatically cleared.

I was using CYLC_VERSION=8.3.1 and my colleague using CYLC_VERSION=8.3.3. We both first noticed it via the WUI but we have shown it also happens via CLI. I don’t know the behaviour in 8.2.

From the scheduler log:

2024-07-31T02:00:40Z INFO - Command "broadcast" received.
    broadcast(cycle_points=['20240327T0000Z'], mode=put_broadcast, namespaces=['ct_to_simpler'], settings=[{'environment': {'FORCE_RESEND': 'no'}}])
2024-07-31T02:00:40Z INFO - Broadcast set:
    + [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no
2024-07-31T02:00:40Z INFO - Broadcast cancelled:
    - [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no

Pausing the workflow, triggering the task, and then broadcasting allowed the broadcast to be kept.

2024-07-31T02:04:48Z INFO - Pausing the workflow
2024-07-31T02:04:48Z INFO - Command "pause" actioned. ID=0ea0b0b8-5b94-4d06-b90b-81b43962e8a9
2024-07-31T02:05:06Z INFO - Command "force_trigger_tasks" received. ID=8055b39c-10e3-4ccf-97ac-ed3a9bdd5d17
    force_trigger_tasks(flow=['all'], flow_wait=False, tasks=['20240327T0000Z/ct_to_simpler'])
2024-07-31T02:05:07Z INFO - [20240327T0000Z/ct_to_simpler:waiting(runahead)] => waiting
2024-07-31T02:05:07Z INFO - [20240327T0000Z/ct_to_simpler:waiting] => waiting(queued)
2024-07-31T02:05:07Z INFO - Command "force_trigger_tasks" actioned. ID=8055b39c-10e3-4ccf-97ac-ed3a9bdd5d17
2024-07-31T02:06:07Z INFO - Command "broadcast" received.
    broadcast(cycle_points=['20240327T0000Z'], mode=put_broadcast, namespaces=['ct_to_simpler'], settings=[{'environment': {'FORCE_RESEND': 'no'}}])
2024-07-31T02:06:07Z INFO - Broadcast set:
    + [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no
2024-07-31T02:07:13Z INFO - Command "resume" received. ID=d74fcf91-6398-4317-b4e2-a251c290029c
    resume()
2024-07-31T02:07:14Z INFO - RESUMING the workflow now
2024-07-31T02:07:14Z INFO - Command "resume" actioned. ID=d74fcf91-6398-4317-b4e2-a251c290029c
2024-07-31T02:07:14Z INFO - [20240327T0000Z/ct_to_simpler:waiting(queued)] => waiting
2024-07-31T02:07:14Z INFO - [20240327T0000Z/ct_to_simpler:waiting] => preparing
2024-07-31T02:07:19Z INFO - [20240327T0000Z/ct_to_simpler/05:preparing] submitted to user-dm_calthunder_d:pbs[6195899]
2024-07-31T02:07:19Z INFO - [20240327T0000Z/ct_to_simpler/05:preparing] => submitted
2024-07-31T02:07:27Z INFO - [20240327T0000Z/ct_to_simpler/05:submitted] => running
2024-07-31T02:07:31Z INFO - [20240327T0000Z/ct_to_simpler/05:running] => succeeded
2024-07-31T02:07:31Z INFO - [20240327T0000Z/ct_archive:succeeded] already finished and completed (flows=1))
2024-07-31T02:07:32Z INFO - Broadcast cancelled:
    - [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no

What were we trying to do? Say we generated some data, and something was found wrong with it in a future task, or the disk got corrupted and we want to rerun the task again with a slightly different setting to delete the old data perhaps. We modify an environment variable in the runtime and run the task. As we can see the task in the WUI and broadcast to it via CLI, we were under the impression we could just do the broadcast and then trigger the task, but that does not work. We can’t see a way to do trigger new task with modified runtime, so our only option appears to be to pause the whole suite. We did not test hold -> trigger -> broadcast -> release.

Desired behaviour

We would like a way to modify the runtime of a succeeded task and trigger it without having to pause, trigger, edit runtime, unpause a workflow to do it.

oliver-sanders commented 2 months ago

I don't think that a broadcast made to a succeeded task should be automatically cleared. This may be leftover Cylc 7 logic, it really doesn't play well with the Cylc 8 concept of new-flows (which may have the same broadcast requirements as the original flow).

Hopefully an easy fix.

We would like a way to modify the runtime of a succeeded task and trigger it without having to pause, trigger, edit runtime, unpause a workflow to do it.

Note, cylc vr can achieve this.

wxtim commented 2 months ago

Currently broadcasts are "expired" (removed) when the cycles to which they refer to have passed. Disabling this seems simple. But I don't think that we want to allow broadcasts to accumulate without housekeeping forever.

I'm going to propose that tasks can consume broadcasts, somewhere towards the run of a task.

Question - should the broadcast be considered cleanable on completion or success?

oliver-sanders commented 2 months ago

Question - should the broadcast be considered cleanable on completion or success?

That's the idea.

When the task is removed from the task-pool as completed, the broadcast can be cleared.

(note task output completion and task pool removal aren't strictly equivalent)

simonathompson commented 3 weeks ago

Just wanted to add, that I've just had a chat with Ronnie in the cylc surgery describing (completely independently) this issue. My context is a regular cycling workflow with no cycle-to-cycle dependencies, where I want to rerun a past cycle. I issue a broadcast at the beginning of a cycle to the family that over-arches all the other tasks in a cycle, that effectively tells it which HPC hall to point to. If I try this in cylc 8.3.3 the broadcast just vanishes as there are no active tasks, and because it's the first thing a cycle does. Fortunately I can get it to do what I want using a new cylc flow. Either way, you end up with either housekeeping broadcasts, or housekeeping flows! Thanks.

hjoliver commented 3 weeks ago

If I try this in cylc 8.3.3 the broadcast just vanishes as there are no active tasks,

Broadcast information is held in the scheduler, to be used for future tasks - which don't have to be in the active pool yet. So lack of active tasks isn't the problem here, it's that that you're targeting past tasks that you want to rerun, and the broadcast clearing mechanism assumes that past tasks are done with, so the broadcast can be cleared. (Which is indeed exactly what this issue is about ... just clarifying the logic in this case).