If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue.
The following workflow illustrates the problem:
[scheduling]
[[graph]]
R1 = "timeout:started => poll"
[runtime]
[[timeout]]
script = "sleep 600"
err-script = "sleep 180"
platform = remote-hpc
execution time limit = PT30S
[[poll]]
# Poll the task after it has failed but before it exits the queue
script = "sleep 50; cylc poll $CYLC_SUITE_NAME timeout.1"
(note: err-script keeps the job in the queue until the workflow manager kills it)
I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.
The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792.
However, I'm struggling to see why we want to allow this.
It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.
If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.
Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.
I'm confused about the changes made in #2396 (to close #1792).
The documentation change implies you need to use "cylc reset" to handle preempted tasks.
In that case, why was the code changed to allow any task to return from the failed state?
And why allow polling of succeeded and failed tasks?
We need to clarify what we're trying to support before agreeing how to address this bug.
If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue. The following workflow illustrates the problem:
(note:
err-script
keeps the job in the queue until the workflow manager kills it) I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.Relevant log messages:
This is clearly a bug.
I think this is the relevant code which causes the problem: https://github.com/cylc/cylc-flow/blob/8.0b3/cylc/flow/task_job_mgr.py#L846 Note that this only occurs for jobs which exit with TERM (not ERR or EXIT).
The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792. However, I'm struggling to see why we want to allow this. It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.
If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.
Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.
I'm confused about the changes made in #2396 (to close #1792). The documentation change implies you need to use "cylc reset" to handle preempted tasks. In that case, why was the code changed to allow any task to return from the failed state? And why allow polling of succeeded and failed tasks?
We need to clarify what we're trying to support before agreeing how to address this bug.
Note that this issue affects Cylc 7 as well.