Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge πŸŒ‰, get rid of your proprietary Zigbee bridges πŸ”¨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.78k stars 1.64k forks source link

Question: network stability and t-shooting #1566

Closed std10k closed 4 years ago

std10k commented 5 years ago

I've been running z2m since roughly July 2018 without many issues until now. The old cc2531 firmware was fine (v 0.1 from memory or something like that - it wasn't reporting its version), the old version of z2m was fine too.
I run it all on a Pi, raspbian, no docker. Hass .92.1 at the moment. I had about a dozen devices: Xiaomi buttons and body sensors, and a ziaomi switch. Then about a month ago i started growing the network on z2m and things started to turn pretty ugly. When Ikea finally brought Tradfri to my region i bought a bunch of light globes to automate what i couldn't automate before. I had to upgrade 2531 firmware to the Feb19 version, and z2m to the latest available as they wouldn't pair otherwise.

So now I have upwards of 40 devices, a bunch of new Ikea lights (and a couple of controllers and dimmers which are paired but not quite working yet), Xiaomi switches (round and square ones, about a dozen), sensors (also about a dozen). The home is a 2 story townhouse, not too big. cc2531 could reach much farther than xiaomi hub from my observations, so range shouldn't really be an issue. As all light globes are routers, the network map is looking a bit Necronomicon.

Stability is lost for good. Some stuff is working fine, some other stuff either not working or working intermittently, or with large delays (10-20 sec). I primarily use body sensors to turn lights on/off, switches to toggle scenes (1-2-3-4-5 and long press for off). Sensors would not trigger when they clearly should, buttons work after 10-20 presses. Not fun at all.

I suspect there is some mess going on in the network. Routers, as i understand, must join directly to the coordinator? if so, would they push each other out if there are more than 15 devices connected to the coordinator? Would they conflict with xiaomi end devices (which i believe don't like switching hosts) for a 'slot' with the coordinator?

I'm looking for t-shooting suggestions as i am pretty much on the verge of throwing all this stuff into a bin.

I wouldn't want to have multiple native hubs (xiamoi proved somewhat unreliable + regional restriction on software are very annoying) as i suspect they'd conflict with each other on the radio.

I'm thinking to move most lights out of z2m to the native Ikea hub (just got my hands on it today -IKEA was out of stock), but ideally, i would like to understand the nature of the issues. I doubt it is HASS, seems like z2m or ZigBee network itself is going crazy. Ah, tried adding 2530 router - it seems to make things even worse.

there are some random errors in the logs, but nothing too conclusive. 5/24/2019, 5:18:25 AM - error: Zigbee publish to device '0xd0cf5efffe2c7525', genLevelCtrl - moveToLevelWithOnOff - {"level":255,"transtime":0} - {"manufSpec":0,"disDefaultRsp":0} - null failed with error Error: AF data request fails, status code: 233. MAC no ack. 5/24/2019, 5:18:25 AM - info: MQTT publish: topic 'zigbee2mqtt/bridge/log', payload '{"type":"zigbee_publish_error","message":"Error: AF data request fails, status code: 233. MAC no ack.","meta":{"entity":{"ID":"0xd0cf5efffe2c7525","type":"device","friendlyName":"0xd0cf5efffe2c7525"},"message":"{\"state\": \"ON\", \"brightness\": 255}"}}'

Koenkk commented 5 years ago

@rickkdotnet are you on the dev branch or zigbee2mqtt 1.4?

rickkdotnet commented 5 years ago

Dev, because of https://github.com/Koenkk/zigbee2mqtt/issues/1566#issuecomment-500145937 .

Oh, I just noticed I cannot switch to log_level 'trace' via MQTT ( PM Could not set log level to 'trace'. Allowed level: 'error,warn,info,debug' )

Koenkk commented 5 years ago

Ok great, i will fix setting log level via MQTT, for now set it via configuration.yaml

rickkdotnet commented 5 years ago

Ok great, i will fix setting log level via MQTT, for now set it via configuration.yaml

Yes, that would be nice. I'm logging to RAM and trace fills up the disk fast. Would be nice if I could switch it on demand, when an issue arises.

As for the firmware... I've lost a gledopto controller overnight, availability timeouts and error 205. I could still control it from the remote bound to the group and still received status changes at z2m.

https://pastebin.com/WkyyiCJL

z2m restart didn't fix it. powercycle didn't fix it. Deleted it from the database and tried to reset the controller (didn't work) and it came back somewhere in between.

Will keep an eye on it.

rickkdotnet commented 5 years ago

Ok, as for the https://github.com/Koenkk/zigbee2mqtt/issues/1566#issuecomment-500145937, I'm monitoring last_seen and noticed I lost a motion sensor. Incidentically the one right next to coordinator, which never gave any problems until now.

Per your request I (short) pressed the button and got this in the logs:

6/11/2019,` 7:21:55 PM - debug: Received zigbee message of type 'endDeviceAnnce' with data '"0x00158d000313e6e5"' of device 'lumi.sensor_motion.aq2' (0x00158d000313e6e5) of endpoint 1
6/11/2019, 7:21:55 PM - trace: Checking if 0x00158d000313e6e5 is pingable
6/11/2019, 7:21:55 PM - trace: 0x00158d000313e6e5 is pingable (false) not router or battery powered

And nothing happens. If I press the button again the log message is not repeated.

If I long press the button (i.e: initiate pairing) the message is repeated, but nothing happens after that.

After that I zigbee2mqtt/bridge/config/remove 'd the sensor, long pressed the button, the sensor joined again.

The last messages of the sensor were:

6/11/2019, 11:09:05 AM - debug: Received zigbee message of type 'attReport' with data '{"cid":"genBasic","data":{"65281":{"1":3005,"3":29,"4":17320,"5":17,"6":[0,8],"10":53509,"11":38,"100":0}}}' of device 'lumi.sensor_motion.aq2' (0x00158d000313e6e5) of endpoint 1
6/11/2019, 11:09:05 AM - trace: Checking if 0x00158d000313e6e5 is pingable
6/11/2019, 11:09:05 AM - trace: 0x00158d000313e6e5 is pingable (false) not router or battery powered
6/11/2019, 11:09:05 AM - debug: Received zigbee message of type 'devChange' with data '{"cid":"genBasic","data":{"65281":[null,3005,null,29,17320,17,[0,8],null,null,null,53509,38,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,0]}}' of device 'lumi.sensor_motion.aq2' (0x00158d000313e6e5) of endpoint 1
6/11/2019, 11:09:06 AM - trace: Checking if 0x00158d000313e6e5 is pingable
6/11/2019, 11:09:06 AM - trace: 0x00158d000313e6e5 is pingable (false) not router or battery powered
6/11/2019, 11:09:06 AM - info: MQTT publish: topic 'zigbee2mqtt/gang-ms', payload '{"illuminance":38,"linkquality":84,"last_seen":"2019-06-11T11:09:05+00:00","occupancy":true,"battery":100,"voltage":3005}'
6/11/2019, 11:09:33 AM - info: MQTT publish: topic 'zigbee2mqtt/gang-ms', payload '{"illuminance":38,"linkquality":84,"last_seen":"2019-06-11T11:09:32+00:00","occupancy":false,"battery":100,"voltage":3005}'

Hope this gives some clue as to what's happening.

Koenkk commented 5 years ago

endDeviceAnnce should mean that the device is still in the network but somehow is not sending any data anymore, can you do zigbee sniffing of this device while triggering occupancy?

rickkdotnet commented 5 years ago

I'll try. I haven't done any sniffing yet, but I just found a GBAN CC2530+RFX2401 stick in the mail so if that works I should be able to set something up I guess.

rickkdotnet commented 5 years ago

@Koenkk well, the new stick works but I can only find the CC2530_CC2591_20190515 firmware for it, which isn't stable (same behavior as the CC2531 and the stick has also already crashed once). I'll see if I can flash it as a sniffer, but if you build new firmware it would be nice if you can build one for this stick too, so I have some options. (Or perhaps I should try this myself... )

Koenkk commented 5 years ago

@rickkdotnet https://github.com/Koenkk/Z-Stack-firmware/blob/dev/coordinator/Z-Stack_Home_1.2/bin/source_routing/CC2530_CC2591_SOURCE_ROUTING_20190611.zip added it.

rickkdotnet commented 5 years ago

The CC2530_CC2591 locked up today on SOURCE_ROUTING_20190611. Didn't have time to analyze it yet. I've also received a second CC2531 which I'm trying to get going as a sniffer, but haven't had much luck yet on MacOS. Can't get whsniff to compile, tried a couple of other options but nothing work so far. Will try again soon.

paulianttila commented 5 years ago

@rickkdotnet, I have been using https://github.com/andrewdodd/ccsniffpiper with wireshark succesfully on mac.

Koenkk commented 5 years ago

I've released a new firmware which is basically the same as 20190315_MAX_STABILITY, hopes this fixes the issue: https://github.com/Koenkk/Z-Stack-firmware/blob/dev/coordinator/Z-Stack_Home_1.2/bin/source_routing/CC2531_SOURCE_ROUTING_20190619.zip

rickkdotnet commented 5 years ago

I was on a short holiday for the last couple of days. During that time 20190610 ran without a hitch (but without much activity as no one was home. Flashed 20190619 on new C2231 a few hours ago and accidentically triggered my 'alarm' which turns on/off all lights via home assistant groups.

Now, it seems my mesh has started a rebellion and it's killing coordinators. :)

At first, the coordinator was still running, led still on and log messages coming in, and most devices still responding to z2m commands. However, my tradfri on/off remotes only worked intermittently and the tradfri bulbs and transformers sometimes turn off briefly. So I tried restarting z2m and it failed with:

zigbee2mqtt | zigbee2mqtt:error 6/26/2019, 7:48:56 PM Failed to start zigbee2mqtt | {"message":"request timeout","stack":"Error: request timeout\n at CcZnp.<anonymous> (/app/node_modules/zigbee-herdsman/dist/cc-znp/index.js:209:22)\n at Object.onceWrapper (events.js:273:13)\n at CcZnp.emit (events.js:182:13)\n at Timeout.<anonymous> (/app/node_modules/zigbee-herdsman/dist/cc-znp/index.js:193:18)\n at ontimeout (timers.js:436:11)\n at tryOnTimeout (timers.js:300:5)\n at listOnTimeout (timers.js:263:5)\n at Timer.processTimers (timers.js:223:10)"}

After I reset the stick it aborted with: zigbee2mqtt | Error: ccznp has not been initialized yet zigbee2mqtt | at CcZnp.request (/app/node_modules/zigbee-herdsman/dist/cc-znp/index.js:134:15) zigbee2mqtt | at CcZnp.(anonymous function) [as zdoRequest] (/app/node_modules/zigbee-herdsman/dist/cc-znp/index.js:177:21) zigbee2mqtt | at Zdo._sendZdoRequestViaZnp (/app/node_modules/zigbee-herdsman/dist/lib/components/zdo.js:29:12) zigbee2mqtt | at Zdo._genericRequest (/app/node_modules/zigbee-herdsman/dist/lib/components/zdo.js:47:10) zigbee2mqtt | at Zdo._specialRequest (/app/node_modules/zigbee-herdsman/dist/lib/components/zdo.js:63:25) zigbee2mqtt | at Zdo.request (/app/node_modules/zigbee-herdsman/dist/lib/components/zdo.js:20:21) zigbee2mqtt | at Controller.request (/app/node_modules/zigbee-herdsman/dist/lib/components/controller.js:300:19) zigbee2mqtt | at req (/app/node_modules/zigbee-herdsman/dist/lib/components/query.js:119:47) zigbee2mqtt | at performReq (/app/node_modules/zigbee-herdsman/dist/lib/components/query.js:127:9) zigbee2mqtt | at /app/node_modules/zigbee-herdsman/dist/lib/components/query.js:133:17 zigbee2mqtt | npm ERR! code ELIFECYCLE

Put the other stick with 20190610 back in and everything seemed normal, but after a few minutes I get timeouts. On restarting the z2m, I get the same sequence of events as on the other stick..

I'll try reflashing the sticks and powercycling all the devices.

rickkdotnet commented 5 years ago

I reflashed the coordinator with CC2531_SOURCE_ROUTING_20190619, and powercycled most devices using the breaker and now at least the stick stays stable. However, I have at least two tradfri bulbs and one transformer which have issues. If they receive a command, they start turning off briefly every few seconds and z2m logs this: https://pastebin.com/LKtXSZB0 After a minute or so they stop for a while until they receive another command. They still do it when z2m is stopped after receiving a command from a remote in their group, so I guess the bulb firmware has an issue.

I'll try resetting them.

rickkdotnet commented 5 years ago

Resetting them and powercycling everything didn't really make a difference. What does seem to make a difference is deleting the troublesome tradfri devices from the database.db. If I delete them from the db, and restart z2m they rejoin quickly (interestingly enough, without resetting or re-pairing them) and after a few more flashes they seem to stop.

Also, I'm getting the 'ccznp has not been initialized yet' error on restart every once in a while. Next time I restart it's ok again.

I'm too tired now to do more troubleshooting, I'll leave it overnight and see what happens in the morning.

Any ideas as to what could've happened @Koenkk ?

rickkdotnet commented 5 years ago

Some of the tradfri stuff is still acting weird and I keep getting 'ccznp has not been initialized yet' every other restart. Also lost a lot of sensors and a gledopto is bouncing on/offline. Will try factory resetting the tradfri stuff.

I'm a bit shocked and puzzled by how seriously everything has broken by just sending a bunch of on/off commands.

Koenkk commented 5 years ago

@rickkdotnet

rickkdotnet commented 5 years ago

Around 45 devices now I guess. Yes, I use the reporting feature, my full config is at https://pastebin.com/s5vuMZXq

I've reset and re-paired all tradfri bulbs and transformers and finally things appear to stabilize.

Koenkk commented 5 years ago

Features like availability and reporting put a lot of extra stress on the CC2531, in combination with that amount of devices more powerful hardware might be needed (e.g. CC2652R, but this is WIP).

If you are still facing stability issues I would recommend to disable these features (for reporting you need to powercycle the tradfri bulbs once after disabling it). An alternative would be to run 2 zigbee2mqtt instances (but this has the down side of having 2 zigbee networks).

rickkdotnet commented 5 years ago

OK thanks. I'll try that, although I'm more inclined to move the Tradfri stuff to an Ikea gateway if the problems persist, these seem to be the most troublesome.

What could cause the 'ccznp has not been initialized yet' error every other restart? This has never happened to me in the past. Seems similar to issue https://github.com/Koenkk/zigbee2mqtt/issues/1492

I'll check if this still appears on to the old CC2531 on the old firmware.

Koenkk commented 5 years ago

Also similar to #1681, let's continue there.

rickkdotnet commented 5 years ago

Ok, I haven't seen the ccznp error yet in the past few days after reverting to my old stick (first on max_stability for a day, and on CC2531_SOURCE_ROUTING_20190619 for 2 days), so I had the feeling there was something with the hardware of my new stick, but seeing #1681 it just might be coincidence. I'm on the dev of Jun 27, 2019. I'll swap sticks tonight and see what happens. I'll use #1681 for any updates on the ccznp issue.

As for the rest of the issues.. : 1) Ikea devices going crazy: In hindsight it seemed like they were stuck in some kind of re-pairing or rejoining loop. I ended up resetting and re-pairing all of the Tradfri bulbs and even some remotes to get everything back working reliably. After that I needed to re-pair all the Xaiomi devices which were sitting behind those devices. The whole thing kept me busy for a couple of hours so I'm a bit hesitant to stress the network again.

2) Dissappearing Xaiomi sensors: I've now deliberately paired the Xaiomi's I lost in the Tradri meltdown to two CC2530 routers and so far I've lost none of them.

rickkdotnet commented 5 years ago

So far everything is completely stable, also on the new stick with the CC2531_SOURCE_ROUTING_20190619 firmware, so nothing to report yet.

I did disable availability monitoring in addition to adding the CC2530s.

Smiggel commented 5 years ago

To make sure: CC2531_SOURCE_ROUTING_20190619 is a router firmware, right? Little confused because it’s stored under the coordinator folder.

Koenkk commented 5 years ago

No its a coordinator firmware

djchen commented 5 years ago

I upgraded from CC2531 firmware 20190223 to 20190619 today and heres my experience: The default 20190619 firmware gave me a ton of issues. It will run ok for 10 minutes and then devices fail to respond. I could repair it by unplug/replug the stick and restart z2m but the same failure happens again shortly after. In addition to restarting it a few times and letting things settle, I also re-flashed the stick a 2nd time just for good measure. Nothing I tried resolved the problem.

Then I tried source routing 20190619 firmware and that one has been working without issue for several hours. I have 3 Sengled bulbs (end devices) in a group and I noticed with this firmware, they don't all turn on in the same exact moment anymore; there is like a 1-2 second mismatch in timing between them.

Notes: Running z2m 1.5.1 with an unmodified CC2531. I have 22 devices in my network and all but 6 or 7 devices are routers.

djchen commented 5 years ago

@Koenkk I spoke too soon. The network is now failing to respond. Seeing error messages like:

Error: AF data request fails, status code: 183. APS no ack.

Error: AF data request fails, status code: 205. No network route. Please confirm that the device has (re)joined the network.

Koenkk commented 5 years ago

@djchen It seems that the coordinator doesn't crash but there are some range/connection problems. I would recommend reading: https://www.zigbee2mqtt.io/how_tos/how_to_improve_network_range.html

rickkdotnet commented 5 years ago

Just wanted to chime in that since I've added the routers almost all strange behavior seems to have dissappeared (on CC2531_SOURCE_ROUTING_20190619).

The only thing left is the occasional occurence of: https://github.com/Koenkk/zigbee2mqtt/issues/1656 , but I haven't got round to debugging that yet.

mabi commented 5 years ago

Just an update about https://github.com/Koenkk/zigbee2mqtt/issues/1566#issuecomment-495653528 As I was also getting duplicate messages I suspected some kind of routing problem, I replaced the CC2530+CC2591 router with an OSRAM smart plug and now the network is stable. Maybe the CC2530 router or the network in general is not behaving well when the signal is not very good.

arjanvroege commented 5 years ago

As posted in issue #1656 I had the same issues. Switching to the Source Routing firmware and disabling reporting and network scan improved the stability of my network. Will monitor this in the coming days. But today was more stable than the past weeks together.. :-)

As in #1656, let me know if you need something from me.. (logging, etc)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.