Hierarchical watchdog - Githubissues

ZubairLK commented 5 years ago

Sometimes systemd services can get stuck in a restart loop, whereas something else needed to restart to bring the application back to a working state.

e.g. resin-supervisor.service keeps failing and restarting. But it needs to trigger a balena service restart to recover fully.

We need such a layered watchdog style mechanism to be able to attempt a device recovery all the way to potentially rebooting a device.

There is a caveat that such an action might obfuscate potential bugs by triggering restarts. Any action by watchdog needs to be followed by a dump of the logs somewhere (network or storage)

Ideally systemd should provide a mechanism of triggering different actions on different levels of failures. But we don't have that at the moment.

A possible way forward is to have another service that healthdog can trigger(crash) and we use the NRestarts for that service. To trigger something else (if needed)..

This needs further investigation.

ZubairLK commented 5 years ago

https://www.flowdock.com/app/rulemotion/r-architecture/threads/M8yBO3CLNZgFBfYTt76Fi016eVM

ZubairLK commented 5 years ago

Listen to Arch brainstorm Tuesday, August 27 Seek back from 58 minutes.

balena-os / meta-balena

Hierarchical watchdog #1625