pool: update DB after removing a task

oliver-sanders commented 1 month ago

Closes https://github.com/cylc/cylc-flow/issues/6315
If the DB is not updated after a task is removed, then it can be respawned in its previous state as the result of upstream output completion.

This fixes the "running(runahead) => running" bug that could cause tasks to get stuck in the running state indefinitely.

Performing a DB write every time a task completes is going to be a performance hit since this is performed per-task not per-main-loop-cycle (i.e. there is no batching efficiency gain). I'm not sure what we can do about this without changes to the task_pool data model. Any ideas?

Check List

[x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
[x] Contains logically grouped changes (else tidy your branch by rebase).
[x] Does not contain off-topic changes (use other PRs for other changes).
[x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
[x] Tests are included (or explain why tests are not needed).
[ ] Changelog entry included if this is a change that can affect users
[x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
[x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

hjoliver commented 1 month ago

Performing a DB write every time a task completes is going to be a performance hit since this is performed per-task not per-main-loop-cycle

Hopefully not too bad, because task completion, even of family members, tends to be staggered rather than all-at-once.

I'm not sure what we can do about this without changes to the task_pool data model. Any ideas?

Yeah, batching DB ops for efficiency can be problematic for "live" data (i.e., not just for the historical record), if there's any chance of certain events occurring between DB updates.

To avoid this kind of bug I guess we have to either:

write live info per event and only batch-write the historical-record tables
batch-write and batch-spawn in the main loop (yuck?)

hjoliver commented 1 month ago

Merging this as the fix is simple and necessary ... the questions are of a wider scope.

oliver-sanders commented 1 month ago

To avoid this kind of bug I guess we have to either:

A fancy solution would be to allow the unwritten data (i.e. the delta) to be queried allowing us to avoid hitting the DB when not necessary, however, this would not be a small job.

cylc / cylc-flow

pool: update DB after removing a task #6409