cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

task reset from succeeded to running #6315

Open oliver-sanders opened 3 weeks ago

oliver-sanders commented 3 weeks ago

There seems to be some circumstance where succeeded tasks can go back to running (apparently in the same main-loop iteration).

This log is typical of the issue:

$ grep 20190401T0000Z/archive_logs log/scheduler/log
2024-07-29T22:59:59Z INFO - [20190401T0000Z/archive_logs:waiting(runahead)] => waiting
2024-07-30T13:17:30Z INFO - [20190401T0000Z/archive_logs:waiting] => waiting(queued)
2024-07-30T13:17:30Z INFO - [20190401T0000Z/archive_logs:waiting(queued)] => waiting
2024-07-30T13:17:31Z INFO - [20190401T0000Z/archive_logs:waiting] => preparing
2024-07-30T13:17:43Z INFO - [20190401T0000Z/archive_logs/01:preparing] submitted to xce:pbs[2492608]
2024-07-30T13:17:44Z INFO - [20190401T0000Z/archive_logs/01:preparing] => submitted
2024-07-30T13:18:05Z INFO - [20190401T0000Z/archive_logs/01:submitted] => running
2024-07-30T13:18:39Z INFO - [20190401T0000Z/archive_logs/01:running] => succeeded
2024-07-30T13:18:39Z INFO - [20190401T0000Z/archive_logs/01:running(runahead)] => running

Every time we see the confusing running(runahead) => running transition. There are no obvious exacerbating circumstances.

This seems to be intermittent, but we appear to have an example which yields this bug relatively regularly, though I've not been able to reproduce it yet.

hjoliver commented 3 weeks ago

It might be useful to see the whole log, not just lines associated with the affected task. Maybe there was some kind of manual intervention that affected runahead tasks? e.g. triggering a runahead task. I'm not sure that we're properly removing task attribute labels like "queued" and "runahead" every time when we (e.g.) force a runahead-limited task task to run.

oliver-sanders commented 3 weeks ago

[edit] crossed wires with another bug report

It might be useful to see the whole log, not just lines associated with the affected task.

The logs are rather long!

There were a couple of indiscriminate triggers, targetting all tasks in the affected cycle in the run up to the issue. I don't think this included any runahead tasks as the cycle was inside the runahead limit at this point.

There were also a couple of kill commands, but these were more targetted and did not affect the task in question.

oliver-sanders commented 2 weeks ago

I haven't managed to make head or tail of this one yet. One user's workflows have encountered this a few times, they are on leave at the moment, hopefully when they return we can get them to run these workflows in debug mode which might give us a better chance of debugging.