contact: detect crashed workflows

oliver-sanders commented 2 years ago

Idea of @dpmatthews

At the moment if a client connection (ZMQ/TCP) fails, then we try to SSH to the scheduler server where the workflow was running and perform a process listing. If the process is found not to be running we delete the contact file, this permits the workflow to be rerun whereas before users would have had to hunt these files down manually.

Instead of just deleting these contact files we could provide a command to list crashed workflows e.g:

cylc scan --state=crashed.
cylc play $(cylc scan --state=crashed).

The UIS could use this information and alert users to crashes. Sysadmins could potentially scan for crashed workflows.

Needs a little thought e.g. if we don't remove the contact file then any client connections (e.g. cylc message commands from orphaned jobs) will continue to attempt to connect to the workflow which could cause additional load, perhaps we would want to mv contact contact.crashed or something like that.

Probably a fairly straightforward feature to implement.

Pull requests welcome!

hjoliver commented 2 years ago

Good idea.

Instead of either removing the contact file (to prevent connection attempts and allow restart) - which gets rid of the crash evidence; or leaving it as-is - which will result in useless connection attempts; maybe there's a middle ground: add a line to the contact file to indicate that the server is down? (After which deliberate removal of the file would be required).

oliver-sanders commented 2 years ago

cylc / cylc-flow

contact: detect crashed workflows #4858