Add watchdog - Githubissues

Smegheid commented 2 years ago

Since the water heater control system has safety implications if it doesn't regulate, it's probably a good idea to have a watchdog as a backup.

The simplest thing I can think of is a script that runs as a one-shot affair that then gets invoked periodically from cron. That way we're not dependent on the watchdog process running continually.

The watchdog should probably check that:

The control process is still running.
The published status has not expired.
- Maybe just use statserv? We get expiry for free that way, but that's another degree of separation from the control loop.
- otherwise, look at the file. Direct output from the control process, but need to compare timestamp against current time. Not that big a deal, I guess.
The claimed state of the pump matches the actual state.
- Like status expiry, where does the claimed state come from? Statserv or the file?
- Nice and cheap to check now that we're on GPIO.
The pump is off outside of the usual sunshine hours.
- Should probably offset the morning and evening times so that the watchdog isn't trigger-happy. The control process isn't completely monotonic, it doesn't execute at set times and is only expected to get round to reacting to things in a reasonable amount of time. Maybe offset each by 10 minutes, just to be really sure.
Maybe that the tank temperature hasn't exceeded some absolute limit?
- Might be complicated. The watchdog is not to touch the A/D converter directly. The python adc script doesn't attempt any form of mutual exclusion and assumes that only one instance is running at a time. We'd be beholden to the status published by the control script, and if the watchdog isn't trusting that, does it make sense to trust its status info?
- If this gets added, the threshold needs to be a significant chunk of a degree higher than the maximum set in the config file. When the tank is around the limit and the pump is off, the panel has likely been sitting in direct sunlight for a while. When the pump is then turned on, we get a slug of really hot water (sometimes 90C) through the return that's enough to heat the tank a tenth or two above the limit.

In all cases, if a problem is found the watchdog should probably kill off the control process (if running) as we no longer have confidence in it, explicitly stop the pump and then restart the controller.

Thoughts on each check are above as sub-bullets.

Smegheid commented 2 years ago

One thought before getting started: water_pump state will read back the state of the pump control. However, for the purposes of the watchdog, what are the odds of a race condition where the control process has toggled the pump but has not yet updated the status file to reflect that?

The window is probably fairly small; the control process probably turns off the relay and then updates the status file within a small number of tens of ms and the watchdog will run once a minute from cron. However, you know what they say: million-to-one chances crop up nine times out of ten.

Not sure if I'm happy about that, and not currently sure what to do about it.

Smegheid commented 2 years ago

Thinking out loud before I'm done for the day and to remind myself once I get started again later: what if the watchdog were to wait on an update to the status file?

The control loop only updates the status once it's done making decisions for that pass. It then goes to sleep for a decent length of time before going again. If the watchdog were to wait for the status file to be updated, then that would accomplish several things:

Validation that the control process is running.
Validation that the control process is still sane if the file contains reasonable-looking information.
Validation that the status information is not stale.
Assuming inotifywait returns quickly after the status file is updated, greatly reduces the chance of the pump state check race condition occurring.

This looks like it's fairly easy to accomplish. inotifywait can take a --timeout option in seconds where it'll exit if the file isn't changed in that window. The control process repeats every 10 sec, so if we set that timeout to a couple of times that, a sane control process must update the status in that window.

Yeah, I think I like the sound of that. It solves several problems at once.

Other thoughts:

It's trivial to modify the existing morning/evening cutoff times with the likes of date -d "5pm + 10 min" +%s.
One of the afternoon pump on-then-quickly-off-again events today around 3pm managed to bump the tank to 56.0C. The return briefly showed 80C, and the pump was on for a grand total of about 50 sec before being turned off again. Even better, the pump was turned off at a tank temperature of 55.7C, but the temperature continued to tise for over a minute after that. Basically, yeah, the watchdog needs a max tank temperature threshold significantly above the maximum that the control loop tries to hit, otherwise the watchdog is going to be yelping and killing the control process prematurely.

Smegheid / water_heater

Add watchdog #1