OpenZWave / qt-openzwave

QT5 Wrapper for OpenZWave
GNU Lesser General Public License v3.0
105 stars 30 forks source link

Supervisor watchdog fails #94

Closed kpine closed 4 years ago

kpine commented 4 years ago

For some reason my ozwdaemon process stopped. I'm not sure what happened, all the logs prior seem to be missing (not sure if that's a Docker fault) which is weird.

Anyways, the watchdog fails for whatever reason and the container is left in limbo. This is the allinone image so ozw-admin is still running. Normall, if ozwdaemon exited the docker restart policy would restart it, but that can't happen in this case.

$ docker logs ha-zwave_ozwd_1
2020-06-02 09:48:44,173 INFO Set uid to user 0 succeeded
2020-06-02 09:48:44,175 INFO supervisord started with pid 1
2020-06-02 09:48:45,178 INFO spawned: 'supervisord-watchdog' with pid 9
2020-06-02 09:48:45,181 INFO spawned: 'ozwdaemon' with pid 10
2020-06-02 09:48:45,184 INFO spawned: 'ozw-admin' with pid 11
2020-06-02 09:48:45,187 INFO spawned: 'websockify' with pid 12
2020-06-02 09:48:45,188 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:47,072 INFO spawned: 'supervisord-watchdog' with pid 14
2020-06-02 09:48:47,072 INFO success: ozw-admin entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-02 09:48:47,073 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-02 09:48:48,165 INFO success: supervisord-watchdog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-02 09:48:48,166 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:48,837 INFO spawned: 'supervisord-watchdog' with pid 18
2020-06-02 09:48:48,875 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:49,877 INFO spawned: 'supervisord-watchdog' with pid 21
2020-06-02 09:48:49,938 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:51,958 INFO spawned: 'supervisord-watchdog' with pid 22
2020-06-02 09:48:52,759 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:55,549 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-02 09:48:56,552 INFO spawned: 'supervisord-watchdog' with pid 23
2020-06-02 09:48:56,658 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-02 09:48:57,661 INFO gave up: supervisord-watchdog entered FATAL state, too many start retries too quickly
2020-06-03 06:59:02,023 INFO exited: ozwdaemon (exit status 0; expected)
kpine commented 4 years ago

Regarding the logs question, just notice the ozwdaemon logs are now stored in the config directory.

kpine commented 4 years ago

So my ozwdaemon stopped because of a broker error. I know it does that intentionally.

[20200603 0:27:48.940 PDT] [ozw.mqtt.publisher] [debug]: "Wed Jun 3 00:27:48 2020 PingResponse\n"
[20200603 0:28:48.933 PDT] [ozw.mqtt.publisher] [debug]: "Wed Jun 3 00:28:48 2020 PingResponse\n"
[20200603 6:58:57.121 PDT] [ozw.mqtt.publisher] [debug]: MQTT State Change 0
[20200603 6:58:57.121 PDT] [ozw.mqtt.publisher] [warning]: Exiting on Failure
[20200603 6:58:57.121 PDT] [ozw.mqtt.publisher] [debug]: Disconnnected
[20200603 6:58:57.121 PDT] [ozw.mqtt.publisher] [warning]: Broker Error 256

I am assuming that because the supervisord-watchdog entered FATAL state, it stopped monitoring ozwdaemon, thus there was never a restart?

Fishwaldo commented 4 years ago

Its bad when the watchdog needs a watchdog :) I've just added a Logfile to the watchdog process. If it happens again, we can see whats going on (I hope!)

kpine commented 4 years ago

Seems to be a problem when the process has an "expected" exit, such has losing connection to the MQTT broker.

2020-06-04 14:50:32,033 INFO Set uid to user 0 succeeded
2020-06-04 14:50:32,035 INFO supervisord started with pid 1
2020-06-04 14:50:33,042 INFO spawned: 'supervisord-watchdog' with pid 10
2020-06-04 14:50:33,045 INFO spawned: 'ozwdaemon' with pid 11
2020-06-04 14:50:33,049 INFO spawned: 'ozw-admin' with pid 12
2020-06-04 14:50:33,053 INFO spawned: 'websockify' with pid 13
2020-06-04 14:50:33,053 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-04 14:50:34,531 INFO spawned: 'supervisord-watchdog' with pid 15
2020-06-04 14:50:34,532 INFO success: ozw-admin entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-04 14:50:34,532 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-04 14:50:34,604 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-04 14:50:36,611 INFO spawned: 'supervisord-watchdog' with pid 16
2020-06-04 14:50:36,621 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-04 14:50:39,967 INFO spawned: 'supervisord-watchdog' with pid 22
2020-06-04 14:50:40,037 INFO exited: supervisord-watchdog (exit status 127; not expected)
2020-06-04 14:50:40,039 INFO gave up: supervisord-watchdog entered FATAL state, too many start retries too quickly
2020-06-04 14:50:43,090 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-04 14:53:28,682 INFO exited: ozwdaemon (terminated by SIGKILL; not expected)
2020-06-04 14:53:30,753 INFO spawned: 'ozwdaemon' with pid 29
2020-06-04 14:53:39,763 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-04 14:54:22,722 INFO exited: ozwdaemon (exit status 0; expected)

Notice that killing the process with SIGKILL caused a restart, but when I stopped the mqtt server it exited "gracefully" and will never come back.

kpine commented 4 years ago

FYI, the above was with the image openzwave/ozwdaemon:allinone-build-10. It created a watchdog.log but it's empty.

Fishwaldo commented 4 years ago

supervisor-watchdog should be fixed now, and thus the container exit if ozwdaemon exits for any reason.

kpine commented 4 years ago

The docker logs look better, no more exit status 127 errors, and the watchdog.log contains READY. I'll know the result tomorrow.

kpine commented 4 years ago

I checked my container this morning and ozwdaemon exited but the container did not restart, so the fix does not appear to be working.

$ docker ps | grep ozwd
8d70cc719275        openzwave/ozwdaemon:allinone-build-18   "/opt/entrypoint.sh"     24 hours ago        Up 23 hours         1983/tcp, 5901/tcp, 0.0.0.0:7800->7800/tcp   ha-zwave_ozwd_1

$ docker logs --tail 20 8d70cc719275
2020-06-08 08:22:50,969 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-08 08:23:00,105 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-08 08:27:28,048 WARN received SIGTERM indicating exit request
2020-06-08 08:27:28,049 INFO waiting for supervisord-watchdog, ozwdaemon, ozw-admin, websockify to die
2020-06-08 08:27:29,097 INFO stopped: websockify (exit status 0)
2020-06-08 08:27:29,102 INFO stopped: ozw-admin (terminated by SIGTERM)
2020-06-08 08:27:31,260 INFO waiting for supervisord-watchdog, ozwdaemon to die
2020-06-08 08:27:32,363 INFO stopped: ozwdaemon (exit status 0)
2020-06-08 08:27:32,367 INFO stopped: supervisord-watchdog (terminated by SIGTERM)
2020-06-08 09:22:09,719 INFO Set uid to user 0 succeeded
2020-06-08 09:22:09,722 INFO supervisord started with pid 1
2020-06-08 09:22:10,727 INFO spawned: 'supervisord-watchdog' with pid 9
2020-06-08 09:22:10,729 INFO spawned: 'ozwdaemon' with pid 10
2020-06-08 09:22:10,732 INFO spawned: 'ozw-admin' with pid 11
2020-06-08 09:22:10,747 INFO spawned: 'websockify' with pid 12
2020-06-08 09:22:12,656 INFO success: supervisord-watchdog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-08 09:22:12,657 INFO success: ozw-admin entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-08 09:22:12,657 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-06-08 09:22:20,744 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-09 07:16:39,694 INFO exited: ozwdaemon (exit status 0; expected)

$ cat watchdog.log
READY
READY

$ tail ozwdaemon.log
[20200609 7:16:38.772 PDT] [ozw.library] [debug]: Debug - Node: 0 Value Deleted
[20200609 7:16:38.772 PDT] [ozw.library] [debug]: Debug - Node: 0 Value Deleted
[20200609 7:16:38.772 PDT] [ozw.library] [debug]: Debug - Node: 0 Value Deleted
[20200609 7:16:38.793 PDT] [ozw.library] [info]: Info - Node: 0 mgr,     Driver for controller /dev/ttyUSB0 removed
[20200609 7:16:38.793 PDT] [ozw.manager] [debug]: Driver Removed for  "/dev/ttyUSB0"
[20200609 7:16:38.793 PDT] [ozw.manager] [debug]: OZW Serial Port Closed
[20200609 7:16:38.793 PDT] [ozw.daemon] [info]: Shutting Down  "ozwdaemon"  Version:  "0.1.18"
[20200609 7:16:38.793 PDT] [ozw.daemon] [info]: OpenZWave Version:  "1.6.1137"
[20200609 7:16:38.794 PDT] [ozw.daemon] [info]: QT-OpenZWave Version:  "1.2.0"
[20200609 7:16:38.794 PDT] [ozw.daemon] [info]: QT Version:  5.12.5

$ date
Tue 09 Jun 2020 08:41:08 AM PDT

ps -efjH shows all the container processes still running from yesterday, except for ozwdaemon.

root     11698   542 11698   542  0 Jun08 ?        00:00:02     containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8d70cc7192753f853799aa468130eabca76ed42c74aa6a570032aa5010cd6452 -address /run/containerd/containerd.sock -conta
root     11717 11698 11717 11717  0 Jun08 ?        00:00:10       /usr/bin/python3 /usr/bin/supervisord -c /etc/supervisord.conf
root     12087 11717 12087 11717  0 Jun08 ?        00:00:00         /usr/bin/python3 /opt/supervisord-watchdog
root     12089 11717 12089 11717  0 Jun08 ?        00:00:00         /usr/local/bin/ozwadmin -platform vnc:port=5901
root     12090 11717 12090 11717  0 Jun08 ?        00:00:03         /usr/bin/python3 /usr/bin/websockify --web /opt/novnc/ 7800 127.0.0.1:5901
Fishwaldo commented 4 years ago

2020-06-09 07:16:39,694 INFO exited: ozwdaemon (exit status 0; expected)

Bah. Forgot to update some of the exit status codes. Should be fixed now.

But - While "fixing" the restart issues - I'm more interested to know why OZWdaemon exited overnight for you? Was there a issue/crash/bug etc?

kpine commented 4 years ago

Ah, good question. I currently run ozwdaemon in VirtualBox on my desktop PC. I put the PC to sleep at night, so naturally the VM is paused. When I wake it up in the morning, the VM resumes, and ozwdaemon thinks it disconnected from the MQTT broker. I don't know if the broker has closed the connection from its end, or if ozwdaemon closed it? Either way, I don't find the behavior unexpected. I am assuming (I have not checked) that when I was using the standalone ozwdaemon image, it was just exiting in this scenario and docker was restarting it.

Fishwaldo commented 4 years ago

Ahhh ok.

I don't know if the broker has closed the connection from its end, or if ozwdaemon closed it?

Probably both. The broker would timeout the connection after your machine sleeps and close, and when it resumes, ozwdaemon would try to send something, see the connection is broken and close.

The watchdog script looks for "failure" codes or crashes and will act on that. Graceful shutdowns or successful exits wouldn't trigger it. There was still a few code paths that were exiting on failure, but not emitting a error condition - hence it the watchdog didn't act.

Lets see what bugs tomorrow uncovers shall we?

kpine commented 4 years ago

I think I can confirm this is working. Updated to openzwave/ozwdaemon:allinone-build-19 and paused the VM while I was at lunch. After pausing, the OpenZWave status was "offline", and when I unpaused it quickly came back online.

Looks like the watchdog is now detecting the error exit (exit(-1)) and restarting the process.

2020-06-09 11:41:53,225 INFO spawned: 'ozwdaemon' with pid 28
2020-06-09 11:42:03,834 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2020-06-09 12:47:27,069 INFO exited: ozwdaemon (exit status 255; not expected)
2020-06-09 12:47:28,077 INFO spawned: 'ozwdaemon' with pid 35
2020-06-09 12:47:38,999 INFO success: ozwdaemon entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
Fishwaldo commented 4 years ago

Thanks for testing!