Document meaning of inactivity timeout

cylc / cylc-doc

Documentation (User Guide, Cheat Sheets, etc.) for the Cylc Workflow Engine.

https://cylc.github.io/cylc-doc/

GNU General Public License v3.0

9 stars 19 forks source link

Document meaning of inactivity timeout #684

Closed hjoliver closed 7 months ago

hjoliver commented 8 months ago

The inactivity timeout is not affected by the presence of unsatisfied xtriggers (including clock triggers), which might be surprising if you abort on inactivity timeout in a clock-triggered workflow.

It seems a bit aggressive to abort on "inactivity" if there are active tasks and/or active xtriggers present.

On the other hand, if we made it more friendly, it might just be equivalent to the stall timeout.

I think this feature was originally added primarily for use in functional tests, where we might assume something has gone wrong if nothing is happening, even if the workflow is not technically stalled.

At the very least, we should document for users exactly what "inactivity" means.

oliver-sanders commented 8 months ago

On the other hand, if we made it more friendly, it might just be equivalent to the stall timeout.

Exactly! Either inactivity means inactivity or it becomes something else.

I see this as a sys-admin feature rather than a test thing. A workflow could hit its inactivity timeout if it has pending xtriggers or even submitted/running tasks. This is useful because if a workflow is sitting there for long periods of time with pending xtriggers or active tasks, something external to Cylc is going wrong. We set an extremely long P30D timeout at our site to automatically mop up anything that's got itself into a strange state.

So, I'm happy with the status quo. We already have events to cover other situations.

Note, we are currently missing one event: https://github.com/cylc/cylc-flow/issues/4957

hjoliver commented 8 months ago

OK, I agree with that.

In that case, we just need to document exactly what inactivity means. And note what you've just described as the intended use case, with a long timeout.

Pretty sure it is not currently defined in the docs. And it's not obvious, for instance, that an unsatisfied xtrigger is "inactive" (it is being actively checked periodically), or that a running task should be considered "inactive" just because it hasn't returned a job status message in a while (it is actively executing)

oliver-sanders commented 8 months ago

The existing docs are here:

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/scheduler.html#workflow-events