OpenZWave / qt-openzwave

QT5 Wrapper for OpenZWave
GNU Lesser General Public License v3.0
105 stars 30 forks source link

Improved startup behavior when MQTT broker is unavailable (#120) #121

Closed kpine closed 4 years ago

kpine commented 4 years ago

When the MQTT broker becomes unavailable, a running ozwdaemon will exit (by design). For the standalone container, a restart policy of always or on-failure will result in ozwdaemon aggressively restarting with bad behavior. If no restart policy is set, the container will stay stopped until the user intervenes. For the allinone container, supervisord handles the role of the docker restart policy and attempts to restart ozwdaemon, but sometimes fails (for both of those, see #120).

The startup/teardown of ozwdaemon is fairly expensive compared to alternative approaches. It also currently results in several different crashes. In this change we use bash redirection to detect if the MQTT broker connection is even available before starting ozwdaemon (inspired by the HA ozw addon). The startup script will wait for a configurable amount of time until either a) the MQTT broker is accessible and ozwdaemon will be started or b) the timeout expires and the startup script will exit with an error. This won't guarantee the connection to the broker will be successful, but it handles more gracefully the case where the MQTT server is down for a significant amount of time.

It might make sense for ozwdaemon to have this kind of behavior built-in. Perhaps that would be a better long term fix as part of #20?

In my testing this avoids the problems seen in #120, although it's not really a solution for the crashes. With a restart policy the standalone container will restart forever until MQTT is started. For the allinone container, the supervisor will attempt to restart ozwdaemon until MQTT is started. It seemed to try forever. I am not sure if that's the expected behavior, or if it should cause the container to exit after some number of tries?

Docker logs from standalone:

2020-07-05T07:23:38.899724516Z Waiting until the MQTT broker is available...
2020-07-05T07:24:08.940484619Z Could not connect to the MQTT broker after 30 seconds
2020-07-05T07:24:10.617970853Z Waiting until the MQTT broker is available...
2020-07-05T07:24:40.619099450Z Could not connect to the MQTT broker after 30 seconds
2020-07-05T07:24:42.248942391Z Waiting until the MQTT broker is available...
2020-07-05T07:25:12.262416612Z Could not connect to the MQTT broker after 30 seconds
2020-07-05T07:25:14.448987769Z Waiting until the MQTT broker is available...
2020-07-05T07:25:44.514987293Z Could not connect to the MQTT broker after 30 seconds
2020-07-05T07:25:46.725733846Z Waiting until the MQTT broker is available...
2020-07-05T07:25:56.025717304Z Executing: ozwdaemon -s /dev/ttyUSB0 --config-dir /opt/ozw/config --user-dir /opt/ozw/user --mqtt-server dev-mqtt --mqtt-port 1883 --stop-on-failure --mqtt-instance 2

Docker logs from allinone. I don't really understand the reaped unknown pid logs. Am I exiting on error incorrectly?

2020-07-05 00:40:39,675 INFO Set uid to user 0 succeeded
2020-07-05 00:40:39,679 INFO supervisord started with pid 1
2020-07-05 00:40:40,689 INFO spawned: 'supervisord-watchdog' with pid 9
2020-07-05 00:40:40,692 INFO spawned: 'ozwdaemon' with pid 10
2020-07-05 00:40:40,695 INFO spawned: 'ozw-admin' with pid 11
2020-07-05 00:40:40,698 INFO spawned: 'websockify' with pid 12
2020-07-05 00:40:42,278 INFO success: supervisord-watchdog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-07-05 00:40:42,279 INFO success: ozw-admin entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-07-05 00:40:42,279 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-07-05 00:40:50,904 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:41:28,316 INFO exited: ozwdaemon (exit status 255; not expected)
2020-07-05 00:41:29,327 INFO spawned: 'ozwdaemon' with pid 22
2020-07-05 00:41:39,446 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:41:59,387 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:41:59,387 INFO reaped unknown pid 54 (terminated by SIGTERM)
2020-07-05 00:41:59,391 INFO spawned: 'ozwdaemon' with pid 55
2020-07-05 00:42:09,435 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:42:29,421 INFO reaped unknown pid 87 (terminated by SIGTERM)
2020-07-05 00:42:29,421 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:42:29,426 INFO spawned: 'ozwdaemon' with pid 88
2020-07-05 00:42:39,504 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:42:59,445 INFO reaped unknown pid 120 (terminated by SIGTERM)
2020-07-05 00:42:59,446 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:42:59,450 INFO spawned: 'ozwdaemon' with pid 121
2020-07-05 00:43:09,489 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:43:29,461 INFO reaped unknown pid 153 (terminated by SIGTERM)
2020-07-05 00:43:29,462 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:43:29,467 INFO spawned: 'ozwdaemon' with pid 154
2020-07-05 00:43:39,598 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:43:59,479 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:43:59,480 INFO reaped unknown pid 186 (terminated by SIGTERM)
2020-07-05 00:43:59,484 INFO spawned: 'ozwdaemon' with pid 187
2020-07-05 00:44:09,644 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:44:29,497 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:44:29,497 INFO reaped unknown pid 218 (terminated by SIGTERM)
2020-07-05 00:44:29,503 INFO spawned: 'ozwdaemon' with pid 219
2020-07-05 00:44:39,700 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:44:59,530 INFO reaped unknown pid 251 (terminated by SIGTERM)
2020-07-05 00:44:59,531 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:44:59,535 INFO spawned: 'ozwdaemon' with pid 252
2020-07-05 00:45:09,628 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:45:29,607 INFO reaped unknown pid 284 (terminated by SIGTERM)
2020-07-05 00:45:29,608 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:45:29,613 INFO spawned: 'ozwdaemon' with pid 285
2020-07-05 00:45:39,812 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:45:59,630 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:45:59,630 INFO reaped unknown pid 317 (terminated by SIGTERM)
2020-07-05 00:45:59,633 INFO spawned: 'ozwdaemon' with pid 318
2020-07-05 00:46:09,655 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:46:29,646 INFO reaped unknown pid 350 (terminated by SIGTERM)
2020-07-05 00:46:29,646 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:46:29,651 INFO spawned: 'ozwdaemon' with pid 351
2020-07-05 00:46:39,690 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:46:59,723 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:46:59,723 INFO reaped unknown pid 382 (terminated by SIGTERM)
2020-07-05 00:46:59,728 INFO spawned: 'ozwdaemon' with pid 383
2020-07-05 00:47:09,779 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:47:29,740 INFO reaped unknown pid 414 (terminated by SIGTERM)
2020-07-05 00:47:29,740 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:47:29,746 INFO spawned: 'ozwdaemon' with pid 415
2020-07-05 00:47:39,793 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-07-05 00:47:59,761 INFO reaped unknown pid 447 (terminated by SIGTERM)
2020-07-05 00:47:59,761 INFO exited: ozwdaemon (exit status 124; not expected)
2020-07-05 00:47:59,765 INFO spawned: 'ozwdaemon' with pid 448
2020-07-05 00:48:09,892 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)

ozwdaemon.log when it was waiting for the broker:

Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Could not connect to the MQTT broker after 30 seconds
Waiting until the MQTT broker is available...
Executing: ozwdaemon -s /dev/ttyUSB0 --config-dir /opt/ozw/config --user-dir /opt/ozw/user --mqtt-server dev-mqtt --mqtt-port 1883 --stop-on-failure --mqtt-instance 2

Also added documentation for the new environment variable to set the connect timeout, as well as other misc related cleanup changes.

Fishwaldo commented 4 years ago

Thanks - So I want to explore the S6 supervisor that you mentioned in another issue. I see this supervisor a bit fragile with regards to the watchdog script...

I've also been exploring the build in Docker HEALTHCHECK option which seems like a candidate to improve reliability.

kpine commented 4 years ago

The hassio addon for ozw is a pretty good place to start, it's where I stole the idea of waiting on the mqtt port. Looks pretty simple.

0o0-TheDude-0o0 commented 3 years ago

So I agree with this. ozw should wait for mqtt host. This saves the network from much downtime.