EVerest / everest-core

Apache License 2.0
102 stars 74 forks source link

Watchdog in EVerest #539

Open corneliusclaussen opened 9 months ago

corneliusclaussen commented 9 months ago

Right now the manager exits EVerest whenever a child (module) dies. Then systemd usually restarts EVerest to recover. If a module hangs in a command handler, the framework will also timeout and exit the module process which results in a restart of EVerest. We recently added some dead-lock detecting mutexes to EvseManager as well to ensure restarts in case a module hangs somewhere.

There is however no generic watchdog functionality that can be used inside of a module to watch a long running thread (e.g. a mainloop thread, or an IO thread) to see if it is still running, so there are still some threads in some modules that may hang without leading to an exit of the module process.

We should ensure that such scenarios will restart EVerest.

corneliusclaussen commented 9 months ago

Work has started in https://github.com/EVerest/everest-core/pull/514 and corresponding PRs in framework and utils. Each module now has a WatchdogSupervisor that can be used within the module to register watchdogs for individual threads. The plan is to extend this further, so that the full chain down to the hardware watchdog is covered:

Module registers a watchdog with its supervisor. The supervisor checks in its own thread if the target thread is still alive. The supervisor thread itself sends MQTT pings to the manager. The manager ensures that the supervisor threads of all modules are running and MQTT communication still works. The manager sends (optionally) watchdog pings to systemd (needs to be setup in the service file), and systemd itself uses the hardware watchdog device to ensure it is still alive.