balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
968 stars 115 forks source link

Hierarchical watchdog #1625

Open ZubairLK opened 5 years ago

ZubairLK commented 5 years ago

Sometimes systemd services can get stuck in a restart loop, whereas something else needed to restart to bring the application back to a working state.

e.g. resin-supervisor.service keeps failing and restarting. But it needs to trigger a balena service restart to recover fully.

We need such a layered watchdog style mechanism to be able to attempt a device recovery all the way to potentially rebooting a device.

There is a caveat that such an action might obfuscate potential bugs by triggering restarts. Any action by watchdog needs to be followed by a dump of the logs somewhere (network or storage)

Ideally systemd should provide a mechanism of triggering different actions on different levels of failures. But we don't have that at the moment.

A possible way forward is to have another service that healthdog can trigger(crash) and we use the NRestarts for that service. To trigger something else (if needed)..

This needs further investigation.

ZubairLK commented 5 years ago

https://www.flowdock.com/app/rulemotion/r-architecture/threads/M8yBO3CLNZgFBfYTt76Fi016eVM

ZubairLK commented 5 years ago

Listen to Arch brainstorm Tuesday, August 27 Seek back from 58 minutes.