Closed hfellner closed 5 years ago
Please try with the dev branch (http://www.zigbee2mqtt.io/how_tos/how-to-switch-to-dev-branch.html) and latest dev firmware (https://github.com/Koenkk/Z-Stack-firmware/tree/dev/coordinator/CC2531/bin)
Thanks for the reply. I've tried your suggestion, but the behavior remains the same, although with this version the time until everything breaks down varies from between minutes to up to 8 hours. When it stops working, also the CC2351 dongle has be be reset by briefly unplugging it from its USB socket, otherwise restarting zigbee2mqtt fails with this error:
zigbee2mqtt:info 3/4/2019, 8:04:59 PM Starting zigbee2mqtt version 1.1.1 (commit #2a3670d)
zigbee2mqtt:info 3/4/2019, 8:04:59 PM Starting zigbee-shepherd
zigbee2mqtt:info 3/4/2019, 8:05:06 PM Error while starting zigbee-shepherd, attempting to fix... (takes 60 seconds)
zigbee2mqtt:info 3/4/2019, 8:06:06 PM Starting zigbee-shepherd
zigbee2mqtt:error 3/4/2019, 8:06:12 PM Error while starting zigbee-shepherd!
zigbee2mqtt:error 3/4/2019, 8:06:12 PM Press the reset button on the stick (the one closest to the USB) and start again
zigbee2mqtt:error 3/4/2019, 8:06:12 PM Failed to start
{"message":"request timeout","stack":"Error: request timeout\n at CcZnp.
The dev branch is more verbose, but the error messages and behavior remain the same - random lights to not turn on or off, with either of the following error messages: zigbee2mqtt:error 3/4/2019, 7:56:25 PM Zigbee publish to device '0x0017880100b29746', genOnOff - off - {} - {"manufSpec":0,"disDefaultRsp":0} - null failed with error Error: AF data request fails, status code: 205. No network route. Please confirm that the device has (re)joined the network. OR zigbee2mqtt:error 3/4/2019, 8:03:54 PM Zigbee publish to device '0x0017880100cccc27', genLevelCtrl - read - [{"attrId":0}] - {"manufSpec":0,"disDefaultRsp":0} - null failed with error Error: request timeout
After a while only the latter message shows up, whenever zigbee2mqtt tries to relay a message to a device, and messages from the zigbee network no longer show up in the log, and the dongle needs a physical reset, along with a restart of zigbee2mqtt.
All the lamps are powered on, and no lamp (router) is further than a few meters from the next one, so I am having a hard time believing that there would actually not be any route to these nodes.
I also noticed that in the dev branch the Hue dimmers no longer work, nor can they be re-paired, which I commented in issue #1182 by JumpmanJunior.
Can you provide the log at the moment it crashes?
I not sure what you mean. It doesn't crash, it appears as if the CC2531 becomes busy in some loop, at least unresponsive to zigbee2mqtt. There is no specific message when this happens. Just some lights no longer react while others in the same group still do, and the following error message starts showing up in the log, as previously quoted:
zigbee2mqtt:error 3/7/2019, 7:54:59 PM Zigbee publish to device '0x0017880100e881cd', genOnOff - off - {} - {"manufSpec":0,"disDefaultRsp":0} - null failed with error Error: Timed out after 30000 ms
followed by
zigbee2mqtt:info 3/7/2019, 7:54:59 PM MQTT publish: topic 'zigbee2mqtt/bridge/log', payload '{"type":"zigbee_publish_error","message":"Error: Timed out after 30000 ms","meta":{"entity":{"ID":"0x0017880100e881cd","type":"device","friendlyName":"0x0017880100e881cd"},"message":"{\"state\":\"OFF\"}"}}'
Then this error message becomes more and more frequent, and communication seems to break down, and every single message zigbee2mqtt receives from the MQTT server and tries to send through the CC2531 stick produces this eror message, but the service itself keeps running. Also, no more traffic from ZigBee sensors (like switches or motions/temperature sensors) shows in the log.
Sometimes even the LED of the CC2531 turns off when this happens. In all cases the stick then has to be hard reset, otherwise zigbee2mqtt cannot restart.
I've also noticed this seems to happen much more frequently when there is more traffic, such as all lamps in a big room being turned off or on at the same time, than when just individual lights are turned on or off.
Overall to me this appears more like a firmware problem than a zigbee2mqtt problem. (In my own opinion, whatever happens, with proper firmware a situation where the stick just becomes unrecoverable stuck should be impossible. Otherwise every device is at risk of eventually getting stuck this way.)
Does this e.g. happen after sending a lot of MQTT commands at the same time? It's quite hard to debug this for me.
I am aware this must be a nightmare to locate, and I really appreciate your taking the time.
It appears as though the time when this happens is more or less random, it has happened overnight, with all devices being idle, but on most occasions it happens right at or during a request to switch on or off more than one light, but only a few. There do not have to be any other actions going on.
I've captured the log of when it happened again today, from 15 minutes before it happened until that time. I've omitted the part afterwards (as there is no more log output, ZigBee traffic no longer reaches zigbee2mqtt). You can view it here (pastebin).
Something I also noticed is that right after zigbee2mqtt restarts, the network map as reported by zigbee2mqtt, starts out flat, with all nodes directly connected to the coordinator, but several of which are marked as offline (probably because of the limit on the number of devices). Then, about half an hour after that, the network map basically explodes, and it then shows multiple layers, with every device being connected to almost every other device. Graphviz took almost 10 seconds turning this into a PDF file and it looked mind boggling. A short time after that it appears zigbee2mqtt could no longer provide the network map, as the request produced just an empty graph, still in GraphViz syntax, but without any nodes. Then, shortly after that, the problem happened again. (Alas, I do not have the PDF file from inbetween anymore, as it got overwritten by the blank PDF later, which I deleted, because blank.)
As I am rather at a loss what's causing this, I am unsure about what I could possibly do to mitigate the problem. I've been toying with the idea of either replacing the CC2531 with a CC2530+CC2591, or with splitting up the network in multiple smaller ones with a coordinator each, tied together with a common MQTT server, but I am unsure which is better, if any.
Could you try increasing this https://github.com/Koenkk/zigbee2mqtt/blob/master/lib/util/zigbeeQueue.js#L2 to e.g. const delay = 500;
. This should reduce the traffic to the CC2531.
I did, but the problem still happened within a few minutes after starting it up.
I also tried what happens if I instead lowered the value to 100, but it did not make the problem occur faster. It did get stuck in a situation where every single MQTT message lead to an immediate error 17, even if it was the only message published to the zigbee network in 30 seconds.
I've changed the value to 1000 now and will observe. This is unbearably slow, though, it takes almost 20 seconds for 9 lights in a room to turn on and set their color.
It took longer to occur with the 1000ms delay setting, but it still happened. I am now quite certain that the problem has nothing to do with messages being sent to the CC2531, as there was only a single light about to be turned on when it happened, with no other messages being sent at the time. I am also sure, it has everything to do with incoming traffic from the ZigBee network. It also fits into the picture that the CC2531 is quite able to lock itself up and in need of a hardware reset all by itself, without zigbee2mqtt running.
I am pretty sure there is some buffering or handshake problem somewhere in the firmware code, where it can get stuck in a state it can never again get out of, or keeps cycling with the same error code. I don't have time at the moment but I am planning on taking a closer look at the CC2531 firmware code (and Z-Stack) to try and find the problem, as soon as I can find some spare time.
Please try with the following firmware: https://drive.google.com/open?id=1-xzI6b8umZFpki-pfaKdLgcPrUUlswe5
Relative to the 20190223
firmware, it has an increased memory heap at the cost of direct connected devices to the coordinator (15 -> 5).
First off, I'd like to thank you for the hard work and efforts you put into this project!
I flashed the CC2531 with the suggested firmware image, and it's been running a few days now, still on the first run, and so far the CC2531 has not locked up. It did kick one of my Hue dimmers permantently off the network, though, (even though it should support 5 devices, which should be enough, I only have 5 end devices, the 24 other nodes are routers) which since has not been able to reconnect and no longer works. Also, still, out of every 5 lamps switched on or off one does not turn on/off with the others. So it's still not an actually usable configuration, but the situation improved a lot over what it was before.
I did reduce the delay value to 50ms (to see if I can get bearable response times) but changed the maximum number of concurrent transactions to 2 (from 5), since whenever the CC2531 stick locked up before, there always seemed to be multiple messages about timed out transactions shortly before it happened. I am planning on doing another test with the original value of 5 concurrent transactions, but I want to see how long this will run before it gets stuck again, first.
I do think this proves that the problem seems to be connected with buffered incoming messages. In my opinion heap overflows should be appropriately handled by the firmware, dropping packets instead of crashing, until heap space is available again. Otherwise no matter what size the heap is, eventually it will get stuck/crash in a moment of particularly dense zigbee network traffic.
Would it be possible to compile a firmware image for the CC2530+CC2591 with the same version as the one you suggested to use on the CC2531? I would like to try if this improves the issue with the lights not turning on/off and response messages not being received.
Since it also supports TI's Z-Stack, do you think TI's CC2652R (ARM Cortex M4F) could possibly be used as a coordinator for zigbee2mqtt? It is 50% faster, and has 10 times more memory than the CC2530/2531, and as such should be able to handle many more devices while still having enough memory to buffer all incoming packets.
@hfellner I indeed have the same though about this issue.
The CC2530+CC2591 max stability firmware is available here: https://github.com/Koenkk/Z-Stack-firmware/tree/dev/coordinator
About the CC2652R, I think the CC2538 is a better alternative as that one is available in USB variants on Aliexpress. I'm also planning to do some experiments with the CC2538.
@Koenkk I agree the CC2538 is probably easier to adopt for this purpose. It also can be programmed with CC controller, whereas the CC2652R needs a different programming adapter.
Regarding the issue, I have run various tests over the weekend, and found that the main reason for the improved stability I had been seeing during the week was reducing the number of concurrent transactions. When I set that value back to its original value of 5, things quickly started to go downhill again, with lots of rsp error 17 messages and timed out transactions, until the CC2531 locked up. So I did a test with the default firmware version and the same reduced number of concurrent transactions, and it, too, worked without problems.
I also did a test with a CC2530+CC2591, but was disappointed. I had expected to see better link quality values, but while with the CC2531 I get values between 30 and 90 depending on the location of the node, with the CC2530+CC2591 I was seeing values between 5 and 30.
@hfellner so if I understand correctly, with the concurrent transaction set to 2 you don't have any problems? (even with the default firmware?)
@Koenkk I would not go as far as saying that. With the concurrent transaction limit set to 2 it is just the only configuration that does not completely lock up after some (usually brief) time.
There are several problems remaining, however: 1) With the normal firmware random nodes keep disappearing from the network and "no route to device" errors are generated for messages to them. Minutes (or hours) later those nodes then work again but others, which worked before, don't. It is next to impossible to switch on or off all lights in a room because of this. 2) Additionally, with the max stability firmware some end devices are permantently kicked off the network and have no chance of reconnecting. 3) With the concurrent transaction limit set to 2 every second or third time switching lights on or off gets stuck inbetween for 30 seconds and then may or may not resume, or leave the affected light in its current state.
With the exception of 1) these are problems created by workarounds to migitate this issue, and all of them are not problems I can accept in a long term solution.
@hfellner thanks for this analysis. I think most of the problems are due to the limitation of the CC253x. Yesterday TI contacted me to check if they can offer me a CC2652R, let's see what comes out of that. Will keep you up-to-date.
@Koenkk I arrived at the same conclusion. To work around the problem meanwhile I split the ZigBee network in two zigbee2mqtt instances on different channels and different network IDs. It appears these problems only appear when the network is larger than the coordinator can directly handle. I guess Z-Stack isn't able to properly handle larger networks. I don't like this workaround, as it complicates handling devices, but this way the problems above do not occur. Looking forward to zigbee2mqtt supporting coordinators able to handle larger networks!
@hfellner currently I've got the CC2652R, I assume that this will solve all large network issues. will try to release this soon, please keep an eye on #211
@Koenkk I've been running this setup for a bit more than a month now, and so far the problems discussed above have never happened again, so it really seems to be related to the CC253x not being able to handle larger networks.
I've read through #211 and then #1429 but I do not seem to be able to gather from those the answer to these two questions:
I've recently got a CC2653R board from TI and would be eager to give it a try. If the CC2652R is a stable, suitable replacement, I guess this issue could be closed, as it's unlikely to be resolved on the CC2531.
Hi,
I seem to be having the same problem but with single device network. Consisting of a
And a cc2531 stick.
It works fine for a few hours, but then overnight the log shows no new updates from the sensor.
I bought an identical sensor and increased the network size to 2 sensors. The new sensor is placed much closer to the stick but has the same problem. Both sensor readings stopped fairly close in time to each other.
I am using the standard firmware and code as per the zigbee2mqtt website from about 2 weeks ago.
Help would be appreciated!
I have tried latest firmware and it works for ~12hours then stops. Power cycling the usb stick, then restarting zigbee2mqtt makes it work again.
Feels a bit like a memory leak or something like that on the firmware side.
Any way I can help? It fails quite repeatedly so I was thinking about setting it running with a debugger, then see what's happening with the heap.
Which of the 2 version are you running? In case of problems i would recommend the source routing one: https://github.com/Koenkk/Z-Stack-firmware/tree/master/coordinator/Z-Stack_Home_1.2/bin/source_routing
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
As mentioned earlier, CC2652R should fix this; we are currently making good progress regarding that: https://github.com/Koenkk/zigbee2mqtt/issues/1429#issuecomment-531919779
Could you try increasing this https://github.com/Koenkk/zigbee2mqtt/blob/master/lib/util/zigbeeQueue.js#L2 to e.g.
const delay = 500;
. This should reduce the traffic to the CC2531.
Hi, @Koenkk and @hfellner.
Im unable to find a CC2652R or a CC2531 in my country. I tried to reduce the traffic to the CC2531 but the link is broken. I have 32 Xiaomi Aqara Motion sensors, with 7 routers and the coordinator goes down after the first boot.
Could you help me with this? I tried to plug and unplug, add more routers, use the edge version and the source routing firmware.
Im getting desesperated...
Thanks you two in advance...
I installed zigbee2mqtt version 1.1.1 (commit #92d88b6), mosquitto 1.4.10-3+deb9u3 (from debian package repository), node js v10.15.1 and npm 6.8.0 on a Raspberry Pi 3, using firmware version 20190109 for the CC2531 stick. The install went without problems, and I was able to pair all my lights and switches, following the instructions in the zigbee2mqtt documentation, and set up a simple flow in Node-RED to control my lights from the switches.
However, after a few (less than 10) times switching lights on and off, first some of the lights don't turn on or off anymore (they do again on the next time), then the entire system stops responding. If I restart zigbee2mqtt, things will work again, but just for a few minutes.
I have 24 Philips Hue lights (5 x LCT001, 13 x LCT003, 4 x LLC011, 2 x LCT012), 2 Xiaomi Aquara battery 2-way wall switches, 2 Hue Dimmers (RWL021) and a Hue motion sensor (SML001).
With the exception of the Xiaomi switches all of these devices previously were connected to a Hue Bridge, along with 5 Hue Tap switches, and worked fine.
zigbee2mqtt prints about 15 messages such as the following to the log: zigbee2mqtt:error 3/3/2019, 1:30:54 PM Zigbee publish to device '(...)', genLevelCtrl - moveToLevelWithOnOff - {"level":254,"transtime":0} - {"manufSpec":0,"disDefaultRsp":0} - null failed with error Error: request timeout
And then stops accepting ZigBee messages entirely, until it is restarted. The MQTT broker keeps running fine, with all clients connected to it, but zigbee2mqtt does no longer relay ZigBee messages to the MQTT broker. It still reads messages from the broker, and attempts to publish them, but fails with the same message as above.
Looking through the issues, it seems similar issues had been opened in the past, all of which appear to have been "solved" by people abandoning the approach.
Is there something I am doing fundamentally wrong? I would really like to somehow get zigbee2mqtt working in a stable, realiable manner.