cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
326 stars 93 forks source link

Handle restart with changed sequence better #6154

Open MetRonnie opened 2 months ago

MetRonnie commented 2 months ago

Description

Restarting a workflow with a waiting or orphaned task that is no longer on sequence (due to flow.cylc having been changed) can result in traceback.

Reproducible Example

Run this workflow:

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[graph]]
        P1 = foo[-P1] => foo
[runtime]
    [[foo]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT == 2 ]]; then
                cylc stop "$CYLC_WORKFLOW_ID" --now;
            fi
        """

When it shuts down, edit flow.cylc:

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo
 [runtime]

Then restart the workflow to get:

$ cylc vr <id>
$ cylc cat <id>
ERROR - list index out of range
    Traceback (most recent call last):
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 742, in start
        await self.configure(params)
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 490, in configure
        self._load_pool_from_db()
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 792, in _load_pool_from_db
        self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
      File "~/github/cylc-flow/cylc/flow/rundb.py", line 932, in select_task_pool_for_restart
        platform_error = callback(row_idx, list(row))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 597, in load_db_task_pool_for_restart
        self.compute_runahead()
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 400, in compute_runahead
        limit_point = sorted(sequence_points)[:ilimit + 1][-1]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
    IndexError: list index out of range
CRITICAL - Workflow shutting down - list index out of range

Expected Behaviour

No traceback; more helpful error message.

Need to handle sequence_points being empty here: https://github.com/cylc/cylc-flow/blob/3962f3d4df171f7ffde4f233e8859abedb3ad451/cylc/flow/task_pool.py#L400

oliver-sanders commented 1 month ago

@MetRonnie, I've not been able to reproduce this one, have tried 8.3.x, 8.3.0 and 8.2.3.

Might need a tweak to the example.

It's possible that this PR might help: https://github.com/cylc/cylc-flow/pull/6229

MetRonnie commented 1 month ago

I can still reproduce on 8.3.2 and 8.3.x as of dc9f01b72ecb022e12d024aaf7c0c0a32a5c6625, from copying the example above.

oliver-sanders commented 1 month ago

Ach, blindingly obvious, I didn't reinstall! #6229 does not appear to fix this :(

oliver-sanders commented 1 month ago

This is very similar to #6229, the compute_runahead algorithm is (correctly) erroring due to a complete lack of tasks.

This latter case should be extremely rare and is an error case in the first place so fairly low priority.

Easy fix however!

With this diff the workflow will shutdown gracefully on restart/reload:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..c992e880b 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point

+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            return False
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

However, a reload that essentially empties the task pool is sure to be a mistake, so going ahead with the restart/reload and wiping out the pool making it difficult to recover the workflow is not necessarily the best move.

We could consider raising an error here (examples of this form are the only confirmed way to hit this bug). This would cause the restart/reload to fail with an informative message:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..647353378 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point

+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            raise WorkflowConfigError('No tasks scheduled to run')
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

Which will it be?

1) Do what the user asked, log something helpful. 2) Prevent the operation from being actioned, raise a helpful error.

MetRonnie commented 1 month ago

Option 2 is no worse than the current behaviour so could at least be implemented as a quick address of the issue.

Edit: however, with the option 2 patch, I've just seen that reloading a workflow after changing

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo

causes the workflow to get stuck in the reloading state after the WorkflowConfigError is raised. (This is pretty identical to the current behaviour).

oliver-sanders commented 1 month ago

Errors raised during reload should cause the reload to be aborted. The error should be logged and the workflow should continue with the original config.

If that's not happening, we should fix it. I think the reload aborts correctly on master (due to the error in compute_runahead).

MetRonnie commented 1 month ago

It aborts, however the workflow status gets stuck as reloading