Sometimes systemd services can get stuck in a restart loop, whereas something else needed to restart to bring the application back to a working state.
e.g. resin-supervisor.service keeps failing and restarting. But it needs to trigger a balena service restart to recover fully.
We need such a layered watchdog style mechanism to be able to attempt a device recovery all the way to potentially rebooting a device.
There is a caveat that such an action might obfuscate potential bugs by triggering restarts. Any action by watchdog needs to be followed by a dump of the logs somewhere (network or storage)
Ideally systemd should provide a mechanism of triggering different actions on different levels of failures. But we don't have that at the moment.
A possible way forward is to have another service that healthdog can trigger(crash) and we use the NRestarts for that service. To trigger something else (if needed)..
Sometimes systemd services can get stuck in a restart loop, whereas something else needed to restart to bring the application back to a working state.
e.g.
resin-supervisor.service
keeps failing and restarting. But it needs to trigger a balena service restart to recover fully.We need such a layered watchdog style mechanism to be able to attempt a device recovery all the way to potentially rebooting a device.
There is a caveat that such an action might obfuscate potential bugs by triggering restarts. Any action by watchdog needs to be followed by a dump of the logs somewhere (network or storage)
Ideally
systemd
should provide a mechanism of triggering different actions on different levels of failures. But we don't have that at the moment.A possible way forward is to have another service that healthdog can trigger(crash) and we use the
NRestarts
for that service. To trigger something else (if needed)..This needs further investigation.