cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
334 stars 94 forks source link

Polling can incorrectly return a failed task to the running state #4513

Open dpmatthews opened 3 years ago

dpmatthews commented 3 years ago

If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue. The following workflow illustrates the problem:

[scheduling]
    [[graph]]
        R1 = "timeout:started => poll"
[runtime]
    [[timeout]]
        script = "sleep 600"
        err-script = "sleep 180"
        platform = remote-hpc
        execution time limit = PT30S
    [[poll]]
        # Poll the task after it has failed but before it exits the queue
        script = "sleep 50; cylc poll $CYLC_SUITE_NAME timeout.1"

(note: err-script keeps the job in the queue until the workflow manager kills it) I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.

Relevant log messages:

2021-11-15T19:15:37Z CRITICAL - [timeout.1 running job:01 flows:1] (received)failed/TERM at 2021-11-15T19:15:36Z
2021-11-15T19:15:37Z INFO - [timeout.1 running job:01 flows:1] => failed
2021-11-15T19:15:46Z DEBUG - [jobs-poll cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no hpc env CYLC_VERSION=8.0b3.dev CYLC_ENV_NAME=cylc-8.0b3.dev bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-poll --debug -- '$HOME/cylc-run/bug.polling3.c8/run8/log/job' 1/timeout/01
    [jobs-poll ret_code] 0
    [jobs-poll out] [TASK JOB SUMMARY]2021-11-15T19:15:46Z|1/timeout/01|{"job_runner_name": "pbs", "job_id": "5698928", "job_runner_exit_polled": 0, "run_status": 1, "run_signal": "TERM", "time_submit_exit": "2021-11-15T19:14:34Z", "time_run": "2021-11-15T19:14:42Z", "time_run_exit": "2021-11-15T19:15:36Z"}
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] (polled)started at 2021-11-15T19:14:42Z
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] => running

This is clearly a bug.

I think this is the relevant code which causes the problem: https://github.com/cylc/cylc-flow/blob/8.0b3/cylc/flow/task_job_mgr.py#L846 Note that this only occurs for jobs which exit with TERM (not ERR or EXIT).

The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792. However, I'm struggling to see why we want to allow this. It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.

If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.

Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.

I'm confused about the changes made in #2396 (to close #1792). The documentation change implies you need to use "cylc reset" to handle preempted tasks. In that case, why was the code changed to allow any task to return from the failed state? And why allow polling of succeeded and failed tasks?

We need to clarify what we're trying to support before agreeing how to address this bug.

Note that this issue affects Cylc 7 as well.

hjoliver commented 3 years ago

Ouch, yes better fix this.

oliver-sanders commented 2 years ago

Duplicate of https://github.com/cylc/cylc-flow/issues/4516

Keeping both copies as they are tagged against different Cylc versions.

dpmatthews commented 2 years ago

Duplicate of #4516

No - that was a different bug