cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

hold: unhold completed tasks with final task statuses #6141

Open oliver-sanders opened 2 months ago

oliver-sanders commented 2 months ago

If a task is manually killed, we put it into the held state.

This is an intentional feature as it suppresses automatic retries which are not likely to be desired in this case.

However, when the task reaches a final status, completes it outputs and is removed from the task pool, it remains in the held state in the data store. This is ok but a tad counter intuitive.

Suggest resetting the held state before removing a task from the pool to ensure that the task is not marked as held in the data store when it drifts into the n=1 window.

I don't think we preserve the held state once a task has left the pool (fact check this), so I think this is a more correct solution. Otherwise you might expect this task to remain held in a subsequent flow.

dwsutherland commented 1 month ago

I don't think we preserve the held state once a task has left the pool (fact check this)

I think we do. Take this one:

[scheduler]
    UTC mode = True
    allow implicit tasks = True
[scheduling]
    initial cycle point = 20240101T00
    [[graph]]
        T-00 = """
a[-PT1H] => a => b?
b? => c
b:failed? => d
"""
[runtime]
    [[root]]
        script = sleep $((5 + $RANDOM % 10))
    [[b]]
        post-script = """
if ((1 + $RANDOM % 10  < 6 )); then
    false
fi
"""

if you hold both c and d, then one will disappear with the output of b, but if you trigger b until the other path is taken then the respective task will reappear held.

This is arguably the correct behaviour.

Interestingly the other task sticks around even with the change in output of b.. To be expected but possibly not desirable..

Perhaps we should test for retries before putting a killed task on hold?

hjoliver commented 1 month ago

I'm not sure your example applies to the issue @dwsutherland ?

I don't think we preserve the held state once a task has left the pool (fact check this) [*]

If you hold c and d, they get added to the scheduler's "future hold" list, to be held if/when they get spawned into n=0.

Retriggering b until the other output is generated will spawn the other downstream task into n=0, after which both of them will be in n=0 and held. Neither of them leave the task pool, to test the statement [*]

Having both b and c in the hold list despite them being on mutually exclusive branches is correct, because the user asked for it (and of course retriggering can cause the other branch to spawn).

hjoliver commented 1 month ago

Extending your scenario a bit (beyond what you described at least):

So the behaviour is consistent with what the datastore shows.

But we should consider releasing the hold when a task is removed from n=0 so that the hold only applies to the next flow (and not all subsequent flows after that). Would users really want a task to remain held even if a new flow comes along later?