dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.88k stars 483 forks source link

deCONZ websocket is losing connection each Sunday night - with site effects for healthy sleep ;-) #4670

Closed djiwondee closed 3 years ago

djiwondee commented 3 years ago

name: Bug report about: Report a bug within the deCONZ REST-API title: '' labels: Bug report assignees: ''


Describe the bug

I'm running a Raspbee II on a raspberry 4B nativ together wired connected to a Homematic IP CCU3, running node-red on it. The Raspi and the CCU3 have a wired ethernet connection to a switch for separating the smart home stuff to a separate network:

image

The router (AVM fritz.box) is always on. Its VDSL reconnect happens each night around 1:30). The Raspi is also up and running all time (last reboot was 20 days ago)

Steps to reproduce the behavior

Crazy symptom: each Sunday night around 3:18, I get some lights magically switched on, which are controlled by an Aqara button, connected to deCONZ RaspBee II. There is no indication whether my network is facing a disconnect at this time.

Expected behavior

Whatever it causes, I would expect a websocket connection loss not getting lights switched on at night.

Screenshots

n/a

Environment

deCONZ Logs

I've mined already the logs of node-red and also the raspi syslog and have found some interesting entries at the relevant time:

This entry appears several times in the relevant period between 3:10 and 3:30 in the Node-red-log:

Mar 28 03:14:43 ccu3 node-red[9324]: [deconz-server:deCONZ raspi] WebSocket connection timeout, reconnecting
Mar 28 03:14:53 ccu3 node-red[9324]: [deconz-server:deCONZ raspi] WebSocket opened

raspi syslog:

This entry appears each 7 min the entire day:

Mar 28 03:17:14 raspi dhcpcd[521]: eth0: Router Advertisement from {{IPV6-addrs}}
Mar 28 03:22:25 raspi dhcpcd[521]: eth0: Router Advertisement from {{IPV6-addrs}}

But the symptom of switched lights to on happens only on Sunday night. This will probably not have an impact since the raspi with the Rasbee II hat has a sticker IP.

This more interesting one, looks more relevant to me:

Mar 28 03:10:43 raspi deCONZ-update2.sh[360]: Error: database is locked
Mar 28 03:13:09 raspi Node-RED[6230]: 28 Mar 03:13:09 - [warn] [deconz-server:deCONZ raspi] WebSocket connection timeout, reconnecting
Mar 28 03:13:19 raspi Node-RED[6230]: 28 Mar 03:13:19 - [info] [deconz-server:deCONZ raspi] WebSocket opened
Mar 28 03:13:42 raspi Node-RED[6230]: 28 Mar 03:13:42 - [warn] [deconz-server:deCONZ raspi] WebSocket connection timeout, reconnecting
Mar 28 03:13:52 raspi Node-RED[6230]: 28 Mar 03:13:52 - [info] [deconz-server:deCONZ raspi] WebSocket opened
Mar 28 03:14:46 raspi Node-RED[6230]: 28 Mar 03:14:46 - [warn] [deconz-server:deCONZ raspi] WebSocket connection timeout, reconnecting
Mar 28 03:14:56 raspi Node-RED[6230]: 28 Mar 03:14:56 - [info] [deconz-server:deCONZ raspi] WebSocket opened
Mar 28 03:15:39 raspi Node-RED[6230]: 28 Mar 03:15:39 - [warn] [deconz-server:deCONZ raspi] WebSocket connection timeout, reconnecting
Mar 28 03:15:49 raspi Node-RED[6230]: 28 Mar 03:15:49 - [info] [deconz-server:deCONZ raspi] WebSocket opened
Mar 28 03:18:03 raspi deCONZ-update2.sh[360]: Error: database is locked

The deCONZ-update2.sh[360]: Error: database is locked appears also several times across the day, but the interesting one is, the websocket loss at night is the only one, encapsulated by it between 3:10 and 3:18 at night!

Any ideas what wakes me up each Sunday at night, always the same time, triggered by that? Please note again: problem is only at each Sunday Night! Rest of the week the setup works as expected! Thanks in advance for your support?

Additional context

On the figure above you see also a Philips Hue bridge, with only two bulbs. Please note: there are 15 bulbs connected to the Hue bridge. Only two of it (the two controlled by the Aqara button using deCONZ showing that behavior. There are no bulbs connected to the RaspBee II directly.

ebaauw commented 3 years ago

What’s the value of lastboot on the lights that turn on Sunday night? Do you see any device announcement messages for those lights in the log when they turn on?

djiwondee commented 3 years ago

Hello Erik,

How do I get the 'lastboot' information? The raspi 'uptime' command said 19 days as per yesterday.

ebaauw commented 3 years ago

Oh, damn, the lights are connected to the Hue bridge, aren’t they? Do I understand correctly:

This smells like a power drop for the network switch and the two lights. Hence my question if the lights reboot. When connected to a Hue bridge, the only way to confirm that is catching them redhanded with a Zigbee sniffer.

Alternatively, it could be an issue with your automation in NodeRed, somehow seeing a buttonevent when the websocket connection is re-established. Do you see any websocket notifications in the deCONZ log that it issues a buttonevent after the connection is re-established?

djiwondee commented 3 years ago

Hello @ebaauw ,

you assumptions above are right!

This smells like a power drop for the network switch and the two lights.

Power drop for the network switch can be excluded as well. In my initial setup there were no network switch, but Raspi, CCU3 and Hue hub connect by ethernet to the router. I've included the switch to really separate the smart home stuff to a separate network, especi get to proof there is no network disconnect, caused by the router.

Do you see any websocket notifications in the deCONZ log that it issues a buttonevent after the connection is re-established?

As far as I understood, a deCONZ-log is not created at runtime. My setup is not typically running in debug mode, especially on Sunday Night. But I'm going to check the node-red-log and the syslog of my Raspi once again for it. Give me please some time for this today.

What triggers the execution of deCONZ-update2.sh at Night and for what reason?

ebaauw commented 3 years ago

No clue. There some scripts for the (old?) standalone gateway solution, which don't make sense for an installation of deCONZ on a system managed by yourself. There's several issues requesting to disable these, but they are re-enbled when you update deCONZ.

djiwondee commented 3 years ago

But I'm going to check the node-red-log

Attached you find the Raspi's syslog and the CCU's log. Node-red on CCU is not providing detailed logging when running on info -level. So I would need to enable debug next Saturday evening to get infos for the Night to Sunday.

ccu log 3 to 5 am.log raspi syslog 3 to 5 am.log

ebaauw commented 3 years ago

Looks like some weekly cleanup scripts run on the Pi around that time, including one for Pihole? Maybe that causes the lost web socket connection?

According to the log, you're also running NodeRed on the Pi?

djiwondee commented 3 years ago

According to the log, you're also running NodeRed on the Pi?

Node-red on the Pi is my playground for dev and test purposes.

including one for Pihole? Maybe that causes the lost web socket connection?

Could be and I need to verify this, but I don't think so...

ebaauw commented 3 years ago

But that one also loses the websocket connection to deCONZ?! Is it connecting through localhost or through the external IP address (which should probably be routed over localhost anyways). Should rule out any networking issues with cabling, routers, or switches. I raise my bet on pihole.

djiwondee commented 3 years ago

Found this in the pihole-ftl.log (very less entries, just 9 for the whole day)

[2021-03-28 03:15:47.803 10892M] Reloading DNS cache
[2021-03-28 03:15:47.806 10892M] Blocking status is enabled

But timestamp is after the lights have switched on. On the other hand I think if it really automatically loads the and rebuild the blocklists, it could fit.

djiwondee commented 3 years ago

deCONZ?! Is it connecting through localhost or through the external IP address (which should probably be routed over localhost anyways).

My deconz Server node on Node-red is configured using the fix IP address in its config:

image

No address resolution needed.

djiwondee commented 3 years ago

@ebaauw I wonder, there is no entry in the cron tab of the Raspi, but pi-hole gets update of block list in the relevant time period. See this info for reference.

...
-rw-r--r--   1 root   root      8156345 Mar 28 03:13 list.51.www.github.developerdan.com.domains
-rw-r--r--   1 root   root        33459 Mar 28 03:13 list.52.raw.githubusercontent.com.domains
-rw-r--r--   1 root   root        33405 Mar 28 03:13 list.53.raw.githubusercontent.com.domains
...

At the end of the block list update the renewed blocklist data base gets swapped and then the DNS entries get flushed from the cache. Could be indeed a side effect...

ebaauw commented 3 years ago

Try and run the script manually, and see what happens.

djiwondee commented 3 years ago

@ebaauw

At least I have identified the trigger process of the network loss. It is provably repeatable on my setup and the winner is: surprise - pihole!

It happens while gravity update (pihole updateGravity), which is executed based on a cron job scheduled by the daemon:

# Pi-hole: Update the ad sources once a week on Sunday at a random time in the
#          early morning. Download any updates from the adlists
#          Squash output to log, then splat the log to stdout on error to allow for
#          standard crontab job error handling.
12 3   * * 7   root    PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole updateGravity >/var/log/pihole_updateGravity.log || cat /var/log/pihole_updateGravity.log

I have issued the command on the console so many time before manually and have never recognized this.

As you may know, the update gravity script of pihole has several steps to perform. One of the steps is a task called [✓] Building tree. The network loss happens exactly within this step. I found this out while running the pihole updateGravity command in one terminal window, while watching the output of the node-red-log (tail -f ...) on a second one.

But - what should I say - the lights do not switch on when testing it manually. Any further ideas?

ebaauw commented 3 years ago

As mentioned above: check the deCONZ and NodeRed logs, preferably in debug mode, around the time the lights turn on. In theory, it could also be a schedule or rule on the Hue bridge.

djiwondee commented 3 years ago

@ebaauw

As mentioned above: check the deCONZ and NodeRed logs

Here is the debug log just created while executing the gravityUpdate manually. I don't see more different related to deCONZ than just again the

Mar 30 11:06:43 ccu3-wineglassbay daemon.warn node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection timeout, reconnecting
Mar 30 11:06:54 ccu3-wineglassbay daemon.info node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened
Mar 30 11:08:47 ccu3-wineglassbay daemon.warn node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection timeout, reconnecting
Mar 30 11:08:58 ccu3-wineglassbay daemon.info node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened
Mar 30 11:09:21 ccu3-wineglassbay daemon.warn node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection timeout, reconnecting
Mar 30 11:09:31 ccu3-wineglassbay daemon.info node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened
Mar 30 11:12:09 ccu3-wineglassbay daemon.info node-red[6252]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection closed
Mar 30 11:12:20 ccu3-wineglassbay daemon.debug node-red[32716]: Module: node-red-contrib-deconz 1.3.2
Mar 30 11:12:20 ccu3-wineglassbay daemon.debug node-red[32716]:         /usr/local/addons/redmatic/var/node_modules/node-red-contrib-deconz
Mar 30 11:12:33 ccu3-wineglassbay daemon.info node-red[32716]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened
Mar 30 11:17:39 ccu3-wineglassbay daemon.warn node-red[32716]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection timeout, reconnecting
Mar 30 11:17:49 ccu3-wineglassbay daemon.info node-red[32716]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened
Mar 30 11:19:48 ccu3-wineglassbay daemon.warn node-red[32716]: [deconz-server:deCONZ raspi-smarthome] WebSocket connection timeout, reconnecting
Mar 30 11:19:58 ccu3-wineglassbay daemon.info node-red[32716]: [deconz-server:deCONZ raspi-smarthome] WebSocket opened

CCU-node-red-.2021-03-30T09_20_31.661Z.log

ebaauw commented 3 years ago

Sorry, that doesn’t mean anything to me. You want a log of the websocket notifications sent by deCONZ, the websocket notifications received by NodeRed, and the Hue bridge API calls made by NodeRed.

djiwondee commented 3 years ago

Hello @Mimiix @ebaauw

interesting finding regarding ping and timeout, I would guess.

Would you like to share an assessment of it? Thanks!

Mimiix commented 3 years ago

I have no clue

@manup can you help out?

djiwondee commented 3 years ago

What if I just raise the pingTimeout to, hm lets say to 5000? What other impact may this cause?

pingTimeout = 3000 https://github.com/deconz-community/node-red-contrib-deconz/blob/273dac0c10db1195c4068e20ae10727d5528552c/lib/deconz-socket.js#L11

    ping() {
        if (this.isReady) {
            this.socket.ping('Hi?');
            this.awaitPong = setTimeout(() => {
                this.emit('pong-timeout');
                this.socket.terminate();
            }, this.pingTimeout);
        }
    }

https://github.com/deconz-community/node-red-contrib-deconz/blob/273dac0c10db1195c4068e20ae10727d5528552c/lib/deconz-socket.js#L83-L91

        onSocketPongTimeout() {
            var that = this;
            that.warn('WebSocket connection timeout, reconnecting');
            that.emit('onSocketPongTimeout');
        }

https://github.com/deconz-community/node-red-contrib-deconz/blob/ce60345a02ff072b7b78cfd7961f465c18edc7c0/nodes/server.js#L187-L191

manup commented 3 years ago

As @ebaauw suggested it would be helpful to look at the deCONZ logs at the time where the issue happens.

https://github.com/dresden-elektronik/deconz-rest-plugin/wiki/deCONZ-debug-flags and here --dbg-info=1 and -dbg-aps=1 are relevant.

The Websocket loss shouldn't trigger anything to turn on lights. I'd turn off or disable Node Red for the Sunday night test to rule out any command coming from there. And perhaps as next test even close deCONZ to rule out the switch sending some funky stuff on his own :)

Zehir commented 3 years ago

What if I just raise the pingTimeout to, hm lets say to 5000? What other impact may this cause?

pingTimeout = 3000 https://github.com/deconz-community/node-red-contrib-deconz/blob/273dac0c10db1195c4068e20ae10727d5528552c/lib/deconz-socket.js#L11

    ping() {
        if (this.isReady) {
            this.socket.ping('Hi?');
            this.awaitPong = setTimeout(() => {
                this.emit('pong-timeout');
                this.socket.terminate();
            }, this.pingTimeout);
        }
    }

https://github.com/deconz-community/node-red-contrib-deconz/blob/273dac0c10db1195c4068e20ae10727d5528552c/lib/deconz-socket.js#L83-L91

        onSocketPongTimeout() {
            var that = this;
            that.warn('WebSocket connection timeout, reconnecting');
            that.emit('onSocketPongTimeout');
        }

https://github.com/deconz-community/node-red-contrib-deconz/blob/ce60345a02ff072b7b78cfd7961f465c18edc7c0/nodes/server.js#L187-L191

I don't think there is other impact on changing the timeout. I did not check this part yet

Zehir commented 3 years ago

The Websocket loss shouldn't trigger anything to turn on lights. I'd turn off or disable Node Red for the Sunday night test to rule out any command coming from there. And perhaps as next test even close deCONZ to rule out the switch sending some funky stuff on his own :)

First you should dont have a websocket that lost connexion. Can you test to connect on the websocket from your computer and see if you got disconnected too ?

The websocket loss could trigger the light on in some cases and that a thing I want to fix. If you have a button in deconz-node-red on input node. When the socket reconnect he re-send the last event so if the last action of a button was to turn on a light the same event will be sent again on socket reconnect. So it could turn on a light depending of your node-red graph. To avoid that you should uncheck the box "send on start" on the in node. This will stop this behavior for the first output of the in node. If you are using the homekit output you are screwed and need to wait for this PR.

djiwondee commented 3 years ago

@Zehir

This will stop this behavior for the first output of the in node. If you are using the homekit output you are screwed and need to wait for this PR.

Thanks for your reply on that. I'm using indeed the 2nd output connected to a node-red-contribution-homekit-bridged node.

  1. My workaround so far is changing the weekly cron job of the pi-hole gravity update, what is triggering this behavior with its resource consuming sqllite-job, to a different (more friendly) time.
  2. I think about to setup pi-hole in a container or setup a further raspi to separate pi-hole at all. No decision on that right now.

Do you have a glue how long it will take for PR#120?

Zehir commented 3 years ago

@Zehir

This will stop this behavior for the first output of the in node. If you are using the homekit output you are screwed and need to wait for this PR.

Thanks for your reply on that. I'm using indeed the 2nd output connected to a node-red-contribution-homekit-bridged node.

  1. My workaround so far is changing the weekly cron job of the pi-hole gravity update, what is triggering this behavior with its resource consuming sqllite-job, to a different (more friendly) time.
  2. I think about to setup pi-hole in a container or setup a further raspi to separate pi-hole at all. No decision on that right now.

Do you have a glue how long it will take for PR#120?

Not really, I did most part but I spent some times on Factorio, hope soon

Mimiix commented 3 years ago

@Zehir Is this a Node-red issue?

Zehir commented 3 years ago

@Zehir Is this a Node-red issue?

Only part of it. The main problem is the socket connexion lost and result of bad events sent by node red on reconnect. I already fixed that in 120 but not released yed. The workaround for now is to add a node after to filter the messages by checking if the last event if within 10 seconds

djiwondee commented 3 years ago

The workaround for now is to add a node after to filter the messages by checking if the last event if within 10 seconds

image

...and using output 1 with "Send state on start" checkbox disabled. The state of the bulbs is persisted in the flow since I need it for an applied bulb dimming feature. The deCONZ connected button for controlling the lights represented by the upstream HomeKit node is implemented as a "stateless programmable switch". I would not expect a problem without getting state on start from the deCONZ connected button. So my lights will hopefully not switch on next Sunday night.

Zehir commented 3 years ago

The workaround for now is to add a node after to filter the messages by checking if the last event if within 10 seconds

image

...and using output 1 with "Send state on start" checkbox disabled. The state of the bulbs is persisted in the flow since I need it for an applied bulb dimming feature. The deCONZ connected button for controlling the lights represented by the upstream HomeKit node is a "stateless programmable switch". I would not expect a problem without getting state on start from the deCONZ connected button. So my lights will hopefully not switch on next Sunday night.

Yes but that state should not have a button event because it's an old event

github-actions[bot] commented 3 years ago

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

djiwondee commented 3 years ago

@Zehir

I already fixed that in 120 but not released yet.

Since I'm using the changed flow setup, to problem is bypassed. But the issue may still exist. Are there any new plans for getting the fix?

Zehir commented 3 years ago

Yes I have a plan for 2.0 to add a setting for that

github-actions[bot] commented 3 years ago

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

github-actions[bot] commented 3 years ago

As there has not been any response in 28 days, this issue will be closed. @ OP: If this issue is solved post what fixed it for you. If it is not solved, request to get this opened again.