Open Smegheid opened 2 years ago
One thought before getting started: water_pump state
will read back the state of the pump control. However, for the purposes of the watchdog, what are the odds of a race condition where the control process has toggled the pump but has not yet updated the status file to reflect that?
The window is probably fairly small; the control process probably turns off the relay and then updates the status file within a small number of tens of ms and the watchdog will run once a minute from cron. However, you know what they say: million-to-one chances crop up nine times out of ten.
Not sure if I'm happy about that, and not currently sure what to do about it.
Thinking out loud before I'm done for the day and to remind myself once I get started again later: what if the watchdog were to wait on an update to the status file?
The control loop only updates the status once it's done making decisions for that pass. It then goes to sleep for a decent length of time before going again. If the watchdog were to wait for the status file to be updated, then that would accomplish several things:
inotifywait
returns quickly after the status file is updated, greatly reduces the chance of the pump state check race condition occurring.This looks like it's fairly easy to accomplish. inotifywait
can take a --timeout
option in seconds where it'll exit if the file isn't changed in that window. The control process repeats every 10 sec, so if we set that timeout to a couple of times that, a sane control process must update the status in that window.
Yeah, I think I like the sound of that. It solves several problems at once.
Other thoughts:
date -d "5pm + 10 min" +%s
.
Since the water heater control system has safety implications if it doesn't regulate, it's probably a good idea to have a watchdog as a backup.
The simplest thing I can think of is a script that runs as a one-shot affair that then gets invoked periodically from cron. That way we're not dependent on the watchdog process running continually.
The watchdog should probably check that:
adc
script doesn't attempt any form of mutual exclusion and assumes that only one instance is running at a time. We'd be beholden to the status published by the control script, and if the watchdog isn't trusting that, does it make sense to trust its status info?In all cases, if a problem is found the watchdog should probably kill off the control process (if running) as we no longer have confidence in it, explicitly stop the pump and then restart the controller.
Thoughts on each check are above as sub-bullets.