Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
456 stars 277 forks source link

ConBee II network becoming unresponsive #275

Closed ndfred closed 3 years ago

ndfred commented 3 years ago

As part of investigating #273 I managed make my remotes (pressing a button wouldn't be reported in the Z2M logs) and network (MQTT requests to switch lights on and off wouldn't work) unresponsive. All my remotes that used binding (meaning they don't go through the coordinator), including the ones controlling lights I tried to switch off through MQTT, are still working in that situation. The Zigbee2MQTT logs in non debug mode are just silent in that case.

I will try to reproduce the issue over the next few days, once I have backup remotes bound to all of my rooms and do not lose control of my lights late in the evening (which happened lots of time over the past months).

What I have tried so far (see my comments in #72):

The commands I am sending to my group when the remote is pressed:

{
  "state": "ON",
  "brightness": 254,
  "color_temp": 366,
  "color": {
    "rgb": "255,205,120"
  },
  "transition": 0.4
}

Even though I am using a group, this will send 4 commands as each of these need to be sent separately, which may be why I am overwhelming my network. I am also running with delays set to 0 as described in #273 which may be a factor too.

As I mentioned, I am opening up this issue to describe the symptoms and investigate the issue myself, if anyone else sees this too please share your experience.

I use a Conbee II with firmware 0x26680700.

ndfred commented 3 years ago

Not exactly related, but I am seeing a ton of messages like this when trying to dim lights with an IKEA remote that is bound to a group:

Dec 22 12:46:42 bagend npm[11773]: (node:11785) UnhandledPromiseRejectionWarning: Error: Read 0x0017880108da33aa/11 genLevelCtrl(["currentLevel"], {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null}) failed (no response received)
Dec 22 12:46:42 bagend npm[11773]:     at DeconzAdapter.<anonymous> (/srv/zigbee2mqtt/node_modules/zigbee-herdsman/dist/adapter/deconz/adapter/deconzAdapter.js:556:23)
Dec 22 12:46:42 bagend npm[11773]:     at Generator.throw (<anonymous>)
Dec 22 12:46:42 bagend npm[11773]:     at rejected (/srv/zigbee2mqtt/node_modules/zigbee-herdsman/dist/adapter/deconz/adapter/deconzAdapter.js:25:65)
Dec 22 12:46:42 bagend npm[11773]:     at runMicrotasks (<anonymous>)
Dec 22 12:46:42 bagend npm[11773]:     at runNextTicks (internal/process/task_queues.js:62:5)
Dec 22 12:46:42 bagend npm[11773]:     at processTimers (internal/timers.js:494:9)
Dec 22 12:46:42 bagend npm[11773]: (node:11785) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 28)

Looks like the coordinator is asking the remote for its status but the remote won't answer within 10s. Which is fine, but causing an exception like this rather than catching the timeout is odd, and I would be curious why we've been asking for status in the first place. Maybe this is the kind of traffic that causes the whole network to get stuck, adding that to my list of things to investigate.

MattWestb commented 3 years ago

The 0x0017880108da33aa is not one IKEA devices but one Philips HUE one as of the prefix 0x00178801XXX and is not one status request its one rejoining of one device (I have one HUE motion sensor that is making that very often and on other that is not making it for the moment).

ndfred commented 3 years ago

You're right, that's actually the new bulb I've paired previously that I can't get an OTA update on. Any idea why some devices will send messages like that @MattWestb?

MattWestb commented 3 years ago

I only have one Philips HUE bulb and its Zigbee Light Link standard and its working OK with my other Zigbee 3 routers but its can not using the advance zigbee 3 functions as router but should not being any problems.

Your HUE bulb leaving the network or being kicked out from it and trying rejoin the network. That can making problems that commands is not being forwarder OK or being delayed then routes is not working and need finding new routes to devices.

First trying rejoining it without resetting it. If not working good then deleting it in Z2Mt and resetting it and then joining it "clean". And if its in trying updating it if it gets OTA files for it.

ndfred commented 3 years ago

Latest update from migrating my whole setup to scenes yesterday:

Next step now that my network is seemingly stable would be to try and reproduce it shutting down by sending a storm of commands and expecting my remotes and MQTT commands to become unresponsive. I will also try and get reporting working. I will post here once that is done.

MattWestb commented 3 years ago

2 problems with your storming the mesh: The network layer is limiting broadcast to around 1.2 / secund (its not zigbee its the network under it that have it implanted for not getting the network killed of broadcast storm) and Silabs (IKEA. tuya, sonoff and new Philips with BT and so on) have one bug that can killing the network stack in router device then receiving parent announcement (its one broadcast then being send then one router device is coming back in the network). ZHA have one dev killing its network by switching power of and on half of its network and some router devices was getting in locked state. Then one other was flooding his network (after implanting folding function in ZHA) and was able triggering the bug after some minutes. The but is fixed for some weeks ago in the SKD but the company's have not implanting it yet. So be little careful then "storming" your network !

Good work done !!!

ndfred commented 3 years ago

Thanks for all the context @MattWestb! Do you have links to all these discussions so I can understand the problem a little better?

And in our case, we wouldn't be using broadcast messages but rather unicast group commands (I don't know if there is a difference in Zigbee, just trying to map that to my network background). What exactly is limited to 1.2 per second? Really interested in the bug present in the new bulbs as well, and any flaw inherent to Zigbee rather than the coordinator / Conbee II that you could point out, just to make sure we are fair when evaluating issues coming from the stick vs general Zigbee limitations.

My scenario would be to send commands to a group of 5 bulbs looking like the one I mentioned in my initial comment in this thread. This is realistic, and could help us dial the Conbee II message queue delays just right to avoid a network meltdown as the coordinator really is causing these in this case. Something like: have the first message go out right away, but then do what we can to enforce not sending broadcast message faster than once every 1.2s.

Edit: interesting Silabs doc AN1138: Zigbee Mesh Network Performance

MattWestb commented 3 years ago

The satrt is https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-739313960 the findings done by ZHA devs in deCONZ git :-))

The broadcast limiting is in the underlying IEEE 802.15.4 but i cant finding the exact writings for the moment but its there and well known in Zigbee and other IEEE 802.15.4 networks.

Sending commands to groups is technical being made with broadcast in the IEEE 802.15.4 layer that making the commands can going around broken routes that is not being fixed of the mesh. Unicast is direct addressed commands to one device and must being routed thru the mesh and if is having router problems its failing or being delayed then the mesh is finding the new open route for delivering the command.

The "unicast group commands" is false then groupe commands is always broadcast in zigbee.

Using groups for lights and sending groupe command from binded controlling devices is the core part of zigbee lights and its also making the controlling is working then the coordinator is offline. If trying using unicast for all commands its domed then the first failing route is happening in the network (can being some temporary radio interference or failing router devices) but can working well if having one very stabile mesh with good routers (no old OSRAM and so on).

MattWestb commented 3 years ago

One known good UK reference: https://github.com/zsmartsystems/com.zsmartsystems.zigbee/issues/1096 Its possible digging deeper but its not necessarily then the fact its one limeting in the under layer of the network protocolls.

ndfred commented 3 years ago

Thanks a lot for all this context again, I'll read through it all! For the purpose of Conbee II network stability then, we would allow any directed packets (commands to a specific bulb, status requests...) to go unthrottled but broadcast packets (group commands basically) to only be allowed to be sent once a second? As the zsmartsystems issue you mentioned notes, we would need to know which broadcast packets are currently circulating in the network to do a good job there too.

The latest Conbee II firmwares now default to source routing and uses a Zigbee 3 stack which differs from the default CC2530 Z2M configuration and could be a factor. In this conversation people are seeing the automatic source routing algorithm sometimes fail to configure the network properly, and they've implemented a manual source routing configuration option into deCONZ.

The Conbee serial protocol specs are available here, and might be a good source of information to diagnose instability in the network.

MattWestb commented 3 years ago

deCONZ was not having source routing enabled / implanted before the change in the firmware and the core program made september. Source Routing is one part of the zigbee standard and its helping the routering in the mesh if its having problems going around bad routes (or as with the CC-253x have not enough resources for keeping the devices routes in the coordinators routing table). It can being good but its also disabling some of the self healing mechanism of the mesh network then getting bad routes.

The broadcast limiting is is made in the devices stack and underlying layer so its only / device (the Cornbee can only doing 8 broadcast in 8 seconds then stack is limiting it) but other devices can still doing broadcast in the mesh (one remote is sending toggle to one groupe) and its working well becos that device is not extending the broadcast limiting in its stack.

All things is happening under the zigbee layer and cant being changed only if being abused we see the result of it in our systems.

All the timing and querying is made in the firmware and the radio part of the host application and its trying getting one good compromise between rapid response and not losing / getting delayed package in the system.

Dont forget that IEEE 802.15.4 / Zigbee is still low bandwidth radio communication and not gigabit glas fiber ethernet without security over head.

MattWestb commented 3 years ago

By the way if getting one zigbee 3 network working with routers that can do their job as zigbee 3 routers its being very solide. I have moving my HOMA dimmers from ZHA because they is "chinese zigbee 3" = is not routing the traffic as zigbee 3 and is behaving as old ZHA routers and making problem with routing traffic in the mesh. My old Philips HUE is working as one ZLL router and its little problematic with other routers and end devices. My old IKEA bulbs is also ZLL but is working very good ad routers and also as parents for Xiaomi sensors that is normally having problems and leaving the network. New IKEA (ZB3) and tuya routers looks being very solide in one real zigbee 3 network and keeping many routes open for rerouting traffic if getting bad routes in the mesh (that the HOMA and old HUE is not doing).

ndfred commented 3 years ago

Kind of reassuring that the Conbee firmware would enforce the "no more than 8 broadcasts in 8 seconds" rule, though what happens beyond that? Does it drop messages or just queue them? And given this is the case, do you think there is any point in implementing a second queuing mechanism / artificial delays in the Herdsman driver for the Conbee? Feels to me like you would want to 1. send commands as quickly as possible to the Conbee, having it deal with throttling 2. have robust logging back from the stick to tell you when throttling is happening and if it detects any errors / high traffic etc...

ndfred commented 3 years ago

Odd, if I run dmesg -H -w I will see messages like this:

[Dec23 15:33] usb 1-1.4: USB disconnect, device number 21
[  +0.000671] cdc_acm 1-1.4:1.0: failed to set dtr/rts
[  +0.296626] usb 1-1.4: new full-speed USB device number 22 using xhci_hcd
[  +0.139160] usb 1-1.4: New USB device found, idVendor=1cf1, idProduct=0030, bcdDevice= 1.00
[  +0.000021] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  +0.000017] usb 1-1.4: Product: ConBee II
[  +0.000017] usb 1-1.4: Manufacturer: dresden elektronik ingenieurtechnik GmbH
[  +0.000015] usb 1-1.4: SerialNumber: DE2215203
[  +0.005261] cdc_acm 1-1.4:1.0: ttyACM0: USB ACM device
[  +3.398793] usb 1-1.4: USB disconnect, device number 22
[  +0.346782] usb 1-1.4: new full-speed USB device number 23 using xhci_hcd
[  +0.139045] usb 1-1.4: New USB device found, idVendor=1cf1, idProduct=0030, bcdDevice= 1.00
[  +0.000020] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  +0.000017] usb 1-1.4: Product: ConBee II
[  +0.000017] usb 1-1.4: Manufacturer: dresden elektronik ingenieurtechnik GmbH
[  +0.000016] usb 1-1.4: SerialNumber: DE2215203
[  +0.004825] cdc_acm 1-1.4:1.0: ttyACM0: USB ACM device

That would explain instabilities / Z2M being unable to send commands to the bulbs or read anything from the network. I thought this might be due to me restarting Z2M a few times but after restarting it again it doesn't appear to be the case. I'll keep an eye out for these.

Apparently the stick reboots if it doesn't receive traffic on the serial connection for a bit, maybe that's what has been happening (I did shut down Z2M while editing some files and updating the dev branch).

ndfred commented 3 years ago

Another occurrence of my network shutting down now, really strange this time around:

I do not believe the issue is due to heavy traffic coming from Z2M / the Conbee stick but have no way to really be sure. The fact that even bound remotes will stop working, but only in a specific subset of the network, has me really puzzled. I would love to be able to pull a Wireshark trace, but I'll need to order all the flashing gear before I can do that.

Update: it didn't fix itself even after a few hours, and the logs show no activity over the night

Update 2: the network recovered after 24 hours, again the only tells that the network was misbehaving were timeouts when issuing commands from the Conbee, no evidence of the Conbee stick rebooting

MattWestb commented 3 years ago

All bounded remotes should normally working if the coordinator is offline. But if some of them is connected thru the coordinator and its going offline they is also going offline and cant sending commands until the parent is online or its changing its parent to one working router, The second thing is if your network not so dense and one router is failing it can happen that you is breaking the network in two parts that dont have connection between them and then no unicast or broadcasts can going between the two network parts until the mesh have finding one new way around (if its possible). My experience is that Rasp/CornBee dont treating direct connected end devices so well (some other coordinators is doing the same) and its better if having them connected thru routers then direct to the coordinator if its possible. I think you was having one outer devices that was falling in sleep or only being blocked and you was losing parts of your network (can also being the cornbee then its one router with extra functionality as coordinator). Not easy to tell but somthing strange was happening and normally one failing router should the mesh healing in seconds not in days is its possible to do it (if having too long distance it can its not possible going around one failing route.

If you was not able communicating with all devices (sending commands / receiving status changes) its only one commun point that have failing = CornBee. If partly it can being other routers that have doing some bad things alone or in combinations with the CornBee.

ndfred commented 3 years ago

Interesting, I hadn't considered the Conbee stick not routing properly, I assumed it would be pretty solid but one or more routers in my network might not be. Running a networkmap command confirms all of the routers (Philips and IKEA lamps really) are unavailable from Z2M's perspective (with an error message Failed to execute LQI after a 60 seconds timeout). In the graph the command produces, 7 lamps have no route to the coordinator and 8 do, but there are no routes that are through routers other than the Conbee stick.

This leads me to believe there is a very serious issue with the Conbee stick, or its interaction with Z2M. This is also after having unplugged / replugged the stick and all of my bulbs that failed to communicate with Z2M initially. I will try and power off every router / bulb along with the Conbee stick and re-plug everything back in, but realistically my best option will probably be to re-pair the whole network.

It might be interesting to set up regular network surveys to make sure all the routers that are expected to be reachable are indeed there, and that their link quality figures look reasonable.

MattWestb commented 3 years ago

Rasp/CornBee is working as Zigbee 3 coordinator and setting up the network in Zigbee 3 "mode" and can being little more picky with how routers is working but what i knowing is you dont having known strange behaving routers that making the mesh one PITA. If one devices is getting of sync with the security frame counter (security function for blocking replay attacks) its dropping all packages from its parent and the route is blocked until its being reseted (can being done repower one router) but the coordinator is / should saving the last valid one also after restart / repower and if its not doing that OK its one magure problem with the network.

If / then starting building the network you should starting adding routers before adding the end devices so the mesh can building its routes and getting redundancy.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days