Closed henriz closed 5 years ago
Its open-source, meaning that everybody can join.
However the above mentioned problems mostly occur due to bad network stability.
About the bulb on/off not working, which bulb is this? This has not been reported before.
Dear Koen,
Thanks, I'm not the kind of complain only and do nothing. I want to help, so I will.
Networking is not the issue here. There are 2 networks at play; TCP/IP and Zigbee. As far as TCP/IP goes, the MQTT broker is running on the same device as zigbee2mqtt, and the messages get to the MQTT broker just fine. So that excludes TCP/IP networking issues.
I monitor the Zigbee network with a secondary device, and the commands are simply not sent. Also, before switching to a pi with zigbee2mqtt and so forth, I used a hue hub, which has been running this network for like 2 year without ever having any hickup. So that also excludes Zigbee networking issues.
Also, note that eventually all commands ARE executed and that once this happens, it processes ALL the commands at once, in sequence, for all lights. So it's hung up on something. There's a device or piece of code in there, which halts execution for like 20-40 seconds. This is also why I think this app needs threading. Is it waiting for an ack somewhere?
About the on/off issue; this happens on startup. So everytime I restart zigbee2mqtt, there are several devices which I have to send a 'set level' command first, before the 'on/off' command works.
I'm pondering ditching domoticz and starting my own development for it, as I also want to use the new group functionality (is the scenes cluster also supported in zigbee-shepherd?) and this would entail quite some rework in domoticz.
As far as joining in, maybe I asked it wrong, but I'd like to be a more structural member of the development team, discuss with you what to do next, which area's to do and so forth. In stead of being a separate island developing and then doing pull requests. Basically you could say I want to be your employee :P
I really think you have a winner here, but it needs to get to v1.0 soon. Don't take this as criticism, i totally realize how these things work and have a huge appreciation for the work you put in allready. Without your work, I wouldn't have this working now :)
Is this the issue I'm having?
cat configuration.yaml | grep retain | wc
34 68 612
So I have 34 devices. And all hue bulbs are routers.
My setup suffers from exactly the same issue. It works fine until I sent a lot of subsequent commands (for example dimming down and up again). It then doesn't respond for about a minute, and then executes all the commands at once.
I might have found the culprit, but haven't confirmed this yet by lookint at and/or changing the code.
As I was on the track of network size, i first started looking at the firmware on the cc2531. The seller (marktplaats) told me he flashed the latest firmware version, so I didn't do anything in that respect. But this issue came along, so I flashed the firmware, the 'normal' firmware from Koen's repository.
Now I get log messages which I have never had before:
zigbee2mqtt:error 1/12/2019, 7:24:52 PM Cannot get the Node Descriptor of the Device: 0x001788010211444a (Error: Timed out after 10000 ms)
The device in question is a Philips Hue Motion sensor, and I get these messages every now and then for all of them (I have 4 of 'm in my network).
Might it be that that timeout is what is clogging the system?
The motion sensors ARE working fine other than that. Since I reflashed I had to re-bind all devices, so I haven't tested it at length yet.
Mark, do you have Hue motion sensors?
My zigbee setup contains only Philips devices, a mixture of different bulbs (all of the recent types and with the lates firmware), Philips dimmer switches and Philips motion sensors.
Mark, do you have Hue motion sensors?
No motion sensors, just Hue Dimmers and Hue lights.
Well then there is a similarity in our setup, as in we both use hue dimmers and lights.
What do you use to control them? Domoticz or ie. hass.io?
I use Home Assistant.
Do you also have 'undefined' device id messages for the controller in the log?
Hi,
I just found this issue after searching for the PM Cannot get the Node Descriptor of the Device: 0x0.......
error.
I have the same issue here with a Xiaomi Aqara double key wireless wall switch (WXKG02LM).
I am using only Xiaomi devices (Xiaomi MiJia wireless switch, Xiaomi Aqara wireless wall switch, Xiaomi MiJia door & window contact sensor, Xiaomi Aqara door & window contact sensor). Running zigbee2mqtt on a RPi zero w. The MQTT server is on a linux machine that is also running HomeAssistant.
If there is any way I can help, let me know (but sadly I am not a good programmer)
EDIT: I am also getting a bunch of those message when pairing devices, is that normal?
PM Message without device!
Those PM message without device are normal, they are a side-effect of devices pairing.
@Koenkk Ik have now confirmed this is an issue in zigbee2mqtt and not a networking issue.
If I run a subscription to all mosquitto channels alongside the zigbee2mqtt output, I can see mosquitto is getting the messages, and that zigbee2mqtt processes them. Until it freezes, mosquitto is then still showing the messages immediatelly, but zigbee2mqtt is NOT.
As mosquito, zigbee2mqtt and the 'all channels subscription' are all running on the same server, there can be only one conclusion. Something is clogging zigbee2mqtt.
Here you can see this: https://youtu.be/tMWPtOGWnqE
The queue hangup can easily be solved by chaning this line https://github.com/Koenkk/zigbee2mqtt/blob/master/lib/extension/devicePublish.js#L24 to e.g. this.queue.concurrency = 100;
I implemented this because when sending commands to multiple devices (e.g. group of bulbs) at exactly the same time, some bulbs failed to respond. This is of course can now be solved with group support.
@henriz can you make the above mentioned change and check if you have any improvement?
Well..diving into the code, directly I see an issue. In the controller.js, the onMQTTMessage function:
const results = this.extensions
.filter((e) => e.onMQTTMessage)
.map((e) => e.onMQTTMessage(topic, message));
Basicaly what this does, if a MQTT message is received, this is forwarderd to ALL device objects, and the result of the call is mapped.
So in my case, I have 34 devices, this means 34 objects are getting this message, processing it, and returning the result.
The javascript .filter function is incredibly slow. Also, suppose in the extensions array, the 2nd extension is the one the message was intended for, it will still keep on going until the 34th message.
This can easily be optimized. I'll see if I can impove on it.
While I'm typing this, I see your answer, Koenkk. I'll try that right now!
@Koenkk The suggested change does not change the behaviour in any way, unfortunatelly.
In fact, it makes it a lot worse :(
Can you try removing the whole queueing mechanism from devicePublish?
const results = this.extensions
.filter((e) => e.onMQTTMessage)
.map((e) => e.onMQTTMessage(topic, message));
Will not call devices, it will just call extensions which can decide to handle this message.
I'll try. In the mean time I'm starting to think this is more a zigbee-shepherd issue.
If I remove this part from zigbee.js:
if (cmdType === 'functional' && entity.functional) {
entity.functional(cid, cmd, zclData, cfg, callback_);
} else if (cmdType === 'foundation' && entity.foundation) {
entity.foundation(cid, cmd, zclData, cfg, callback_);
} else {
logger.error(`Unknown zigbee publish cmdType ${cmdType}`);
}
It functions fine and keeps on functioning. Well, it doesn't function, as nothing actually gets sent to the lights, but you get what I mean.
Also this new firmware has worsened the situation a lot. So right now I'm more battling to keep it to work at all, as oppose to testing this issue.
I get a lot of:
zigbee2mqtt:error 1/13/2019, 8:01:11 PM Cannot get the Node Descriptor of the Device: 0x0017880102030d92 (Error: Timed out after 10000 ms)
Especially on motion sensors. I flashed this firmware: https://github.com/Koenkk/Z-Stack-firmware/tree/master/coordinator/CC2531/bin
Using a Arduino Due and the alternate flashing method. If all goes well my 'real' flasher will arrive this week.
What firmware do you recommend?
This is my network, with a few devices missing:
Philips Hue White Single bulb B22 (Router)
Philips Hue white and color ambiance E26/E27/E14 (Router)
Philips Hue white and color ambiance E26/E27/E14 (Router)
Philips Hue white and color ambiance E26/E27/E14 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue white A60 bulb E27 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue white ambiance E26/E27 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue white ambiance E26/E27 (Router)
Philips Hue white ambiance E14 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue dimmer switch (EndDevice)
Philips Hue dimmer switch (EndDevice)
Philips Hue dimmer switch (EndDevice)
Philips Hue dimmer switch (EndDevice)
Philips Hue motion sensor (EndDevice)
Philips Hue motion sensor (EndDevice)
Philips Hue motion sensor (EndDevice)
Philips Hue motion sensor (EndDevice)
Philips Hue White Single bulb B22 (Router)
Philips Hue White Single bulb B22 (Router)
Philips Hue dimmer switch (EndDevice)
Philips Hue dimmer switch (EndDevice)
Now that I look at it myself, it's the 'end devices' which make the real issues.
@henriz I'm currently running this firmware: https://drive.google.com/open?id=1Iu-vVJw_d0bKeQKQhcZbkpQCfGB3z-Bi
It's almost the same as the current z-stack-firmware/dev but with increased XDATA.
Thanks, I'm going to flash that now. Already cringing the having to rebind everything again. They should find a solution for that π¦
There are too many variables now to properly debug the original complaint. In general, I see issues with the devices which are 'end devices', so the dimmerswitches and the motion sensors. The dimmerswitches often produce an 'unable to reconfigure', the motion sensors 'unable to get device descriptors'.
However, those are separate issues. The clogging still happens, but the firmware I have now, is borked somehow. In fact, even after a fresh reboot or power down/up it won't start half the time, and the led on the device switches off.
So I'll flash 'your' firmware and then see if it has improved, if so, I'll test what you asked me to do earlier.
I would also wish there was an easier solution to prevent rebinding when upgrading the firmware. I'm shielding the stick and stop every routers to make sure it doesn't detect the existing panid/chanid. But it's far from convenient :)
All the issues you're talking about, have they appeared with latest version ? I'm still using dev from end of october, I'm a bit hesitant to upgrade.
about the firmware, I'm also a bit lost, I'm running a "optimised for larger network" cc2530+cc2591 firmware. Does the latest dev one with group support also have the optimisations ? (increased xdata i guess)
So far, all newer versions have been a total disaster for me. Asides from the queue appearing to be clogging, I had it working fine. Since starting to upgrade, I haven't had it working fine again.
Often NPM won't start, bound devices aren't reached, and so forth.
I get a lot of this at startup:
zigbee2mqtt:info 1/14/2019, 3:25:40 PM Error while starting zigbee-shepherd, attemping to fix... (takes 60 seconds)
/opt/zigbee2mqtt/node_modules/q/q.js:155
throw e;
^
TypeError: Cannot read property 'close' of undefined
at shepherd.start (/opt/zigbee2mqtt/lib/zigbee.js:45:47)
at /opt/zigbee2mqtt/node_modules/q/q.js:2059:17
at runSingle (/opt/zigbee2mqtt/node_modules/q/q.js:137:13)
at flush (/opt/zigbee2mqtt/node_modules/q/q.js:125:13)
at process._tickCallback (internal/process/next_tick.js:61:11)
@henriz this can be fixed by unplugging, plugging, press the button on the cc2531 (close to the usb) and start zigbee2mqtt.
I've just upgraded (silly me) I also had the "TypeError: Cannot read property 'close' of undefined" error. Having to unplug/plug/press button to resolve that kind of issue is awful. With my cc2530 I just had to restart zigbee2mqtt.
@henriz what revisions were running ok-ish before ?
Pressing that button doesnt do anything for me. Only way to get past it is to reboot and hope it works then. After booting I have to wait untill the light on the stick switches off, before starting zigbee2mqtt.
I can't get it to start at all anymore. I'm starting to feel the end-devices (dimmerswitches/motion sensors) are reaking havoc on the firmware in the device.
@lolorc: I have already beaten myself for not noting the firmware version which was on the stick when I got it, as that was working reasonably well. Upgrading seems to have opened a whole can of worms for me. WIth no devices bound, it worked fine, I was able to bind every device, but as soon as I added the dimmerswitches and motion sensors, it simply stopped working at all.
Even if I now run the zigbee-shepherd simple sample script, it doesn't work.
So even this script:
var ZShepherd = require('zigbee-shepherd');
var shepherd = new ZShepherd('/dev/ttyACMZigbee'); // create a ZigBee server
shepherd.on('ready', function () {
console.log('Server is ready.');
// allow devices to join the network within 60 secs
shepherd.permitJoin(60, function (err) {
if (err)
console.log(err);
});
});
shepherd.start(function (err) { // start the server
if (err)
console.log(err);
});
Won't start. What I think happens is that the device, when powered on, takes it's role and starts communicating. Somehow the motion sensors or dimmerswitches are causing it to time out.
This might also be the underlying issue that started this whole thread in the first place.
But it's all guesses.
@henriz I had the exact same issue as you. Sensors, switches and bulbs on/off functions worked as expected, but dimming lights was a pain. The 'network clogs', and from 20sec to 1 minute later, all the previous commands act in sequence. I ended up changing all the bulbs (including Ikea ones) to a Hue bridge, but still hoping to see them working in zigbee2mqtt one day
Well I'm a persistant bugger and a developer, so I won't stop until this works.. it has become a personal matter now! >:)
I do have a Hue bridge, but won't cave in. It's them or me :)
Well.. i finally got my zigbee network working again. The only way to accomplish that, was to re-flash the stick. With the same firmware.
To be on the save side, I also re-checked out the dev branch of this project.
I removed the batteries from the dimmer switches and motion sensors, to prevent then to rejoin.
I have a suspicion the issue is with endpoints. The switches and motion sensors are endpoints, not routers. So now I'm first going to test without, I can simulate the button presses to the dimmer switches.
As a sidenote. I first tried flashing the firmware of the zigbee-shepherd project. This always fails at 46% of the first big part. Then I reflashed the firmware koenkk linked here. I could flash this without issue, however, I noticed that on startup of zigbee2mqtt there was a red led flashing on the stick. I have never seen that led before, it has never done anything.
This might also make it possible it's an issue with the alternative flashing method. My real flasher is still on its way from chine, so I use the alternate flashing method. As I understand there are some values set in the flashing software, which might not be part of the flash itself. The zigbee-shepherd firmware has like 10 bytes it writes before big chuncks, maybe those contain the proper values?
In any case, I got it running again, so I can now continue testing.
Well, it's not the endpoints, but I have confirmed it's clogging.
When running debug output, in a series of commands, eventually you get this:
serialport:unixRead Starting read +2m
serialport:unixRead Finished read 10 bytes +1ms
serialport:main binding.read finished +2m
cc-znp { sof: 254,
cc-znp len: 5,
cc-znp type: 'AREQ',
cc-znp subsys: 'ZDO',
cc-znp cmd: 'srcRtgInd',
cc-znp payload: { dstaddr: 47929, relaycount: 1, relaylist: <Buffer a3 3b> },
cc-znp fcs: 159,
cc-znp csum: 159 } +2m
serialport:main _read reading +6ms
serialport:bindings read +7ms
serialport:unixRead Starting read +6ms
cc-znp:AREQ <-- ZDO:srcRtgInd, { dstaddr: 47929, relaycount: 1, relaylist: <Buffer a3 3b> } +2ms
zigbee-shepherd:msgHdlr IND <-- ZDO:srcRtgInd +2m
So at some instances, it halts reading from the serial port, in this case even for 2 minutes. This is unrelated to the actual command which is being sent. The actual reading of the port takes 1ms, but to finish of that read, takes 2 minutes.
@henriz nice finding, perhaps it's indeed a zigbee-shepherd issue.
nice to find it, but zigbee-shepherd isn't updated anymore i assumed? So it's not easy to fix at all..?
@trekker25 that means we have to update it :)
I'm not convinced that the issue is with zigbee-shepherd yet. However, I have confirmed there IS an issue. If I understand it correctly, those timing messages (+1ms, +2s, etc) are the time between the previous command and the current, so it might still be a bug in zigbee2mqtt.
It's quite annoying to debug, as it's node.js and zigbee-shephard uses lots of promises. Me personally I wouldn't have touched node for this, but write it in C/C++, but that's personal preference.
Wether or not it's updated anymore, isn't that relevant, as it's open source. We can update it ourself. Provided we find the bug. Which I haven't, yet.
@henriz
My zigbee network decided to giveup with this same error overnight, and now I can't get my network to reboot. Hopefully re-flashing the stick will fix it for me.
Best of luck trying to trace this down! You're not the only one experiencing this!
So just to add some more information to the puzzle. I re-flashed the CC2530, and of course lost all my pairings. However, I did try to wrap the stick in foil, as suggested to try and keep everything paired.
Now I've noticed, that if I restart zigbee2mqtt, I will get the Cannot read property 'close' of undefined
unless I have the stick wrapped in foil on boot. Once booted I can remove the foil, and then start re-pairing my devices.
Very strange
When reading your comments I ask myself if the CC2530 is the right device for zigbee2mqtt in the long run or if it wouldn't be possible to use the ConBee stick in the future.
It's not the device which is causing all of this. It's a mixture of firmware and software.
@philhawthorne how do you flash? with the programmer or using the alternate flashing method?
I used the programmer to flash it
Hi, my test setup is very small; I've only got a "Philips Hue white A60 Bulb E27" and a CC2531 attached to a Raspi and I'm also experiencing the described queueing problem.
I've got my "production system" Hue Bridge running in parallel for my other bulbs. Could the Hue be interfering with the CC2531 even when no lights are being switched? Does zigbee have "keep alive" packages or something like that?
Can you maybe point me to another project using the CC2351, so I could try whether the problem exists there as well? Something also based on zigbee-shepherd? Or something that's not based on it? This could help in identifying the problematic component.
It's a tough issue to troubleshoot. I've built in log messages for about every function, but it somehow just halts at times. Even after having been working on this for a few days, I still haven't found WHERE or WHAT it's actually halting on.
From the looks of it, it's the device itself which is getting backed up, so it appears to be a firmware issue, not an issue with zigbee2mqtt or zigbee-shepherd.
Jofagi: does your 'production' setup include dimmer switches and/or motion sensors? Those do strange stuff. Like the motion sensors can be all the sudden bound, without any binding messages and without resetting them. Dimmerswitches often fail to get configured.
I think the bottom line is that this device (the USB dongle) simply isn't fit to be a controller in a larger setup. It lacks memory and speed of communication. It's a hack.
@henriz I've only got actors such as bulbs and stripes, but no sensors or switches in my setup.
I'm not convinced, that the dongle itself is just not fit. After all there appear to be quite a lot of people using it. I came to this project from a article of the German c't magazine. I assume they had good experience with it; otherwise they would have at least mentioned the problems it in the article.
And since I'm already having problems with a single bulb, something really weird must be going on...
Another wild guess: Could it be a power issue? I've got the stick attached to a raspi, which also runs some other services and has a HDD attached. I assumed if there was a problem with the power supply, the raspi would simply reboot. But since it doesn't do that; I didn't worry. However, if the low-power behavior is different, for example by cutting the dongle off in order to save the rest of the system, could this result in such behavior?
Do you also run it on a raspi that's possibly under-powered?
@Jofagi
I have mine on a usb port of a QNAP. I'm running Hassio on a Ubuntu vm and can confirm that the issue remains (comming from the raspi). So probably not a power issue to the CC Since I changed all the bulbs from the zigbee2mqtt network to the Philips hue I have no complaints. The temp sensors are not a good way to check, but all my switches and motion sensors act immediately and I don't have any more clogs in the zigbee2mqtt network that I'm aware of. I only had clogs when using dimming function on bulbs (both Philips and Ikea).
I expect that at least some of the above mentioned clogging issues should be fixed in the dev branch.
That would be very cool. I just haven't had the time to really dive into it yet. Gonna test it immediatelly.
This has indeed - to some degree - fixxed the clogging issue. The application is now not hanging on a timeout, as it was before, so it keeps on processing commands, where before this change it did not.
However, in practice stuff still cloggs up.
What you have fixed now is that callback being synchronous and halting the process, however, the asynchronous callback still fails so if you purely look at the lights and what they're doing, it still clogs.
In some respect, this fix has made it worse. As now it doesn't still finish your set on commands once it resumes.
What seems to happen; after a series of commands to the same device, the device starts timing out. At that point ALL the devices timeout. In the old situation, the code would pause on the first timeout, and after 10 or 30 seconds, continue. In that time period all devices have had the chance to catch up, and it will resume your 'programmed' commands (the queue).
What happens now is that the timeout on the first device doesn't hold back the process, so you get timeouts on all the commands. By the time the devices are reachable again, the entire queue has been sent and timed out.
So this has fixed a symptom but not taken away the root cause, which is devices starting to time out if they get a bunch of commands sequentially.
It appears now after whatever is clogging has happened, the serial link is disconnected, which is causing it not to function anymore. I waited about 15 minutes but none of the devices would respond. Then when I tried stopping zigbee2mqtt I got this, on an endless loop. The only way to get past it, was to log in on another console and kill the node process.
serialport:main close attempted, but port is not open +0ms
serialport:main close attempted, but port is not open +0ms
serialport:main close attempted, but port is not open +0ms
serialport:main close attempted, but port is not open +0ms
serialport:main close attempted, but port is not open +1ms
[..] and so on
Would a queue per device solve this problem where subsequent commands to the same device are delayed for say 100msec to avoid overloading the device?
@henriz it seems that in your case the CC2531 starts hanging, where the clogging issues I've seen were due to ACK errors.
It's the same thing. If there's no ACK, it will timeout.
@henriz that doesn't seem to be true in all situations, see the example where I successfully control kitchen_kettle_plug
while I pulled the power plug of the living_room_standing_lamp
: https://hastebin.com/ecuxefudoc.sql
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello,
I'd like to get in to the development of this plugin, to help improve it.
At it's current state, in my opninion, it's unreliable. It works for about half a day, and then starts acting up. And when it works, there's something wrong in the queing mechanism. In situations with multiple commands, it somehow seems to get clogged with commands.
This results in it doing nothing for a certain amount of times, and then all the sudden process the queue of commands. I understand there has been a change recently for situations with multiple commands, I think something went wrong there?
I'm using hue dimmer switches which are controlling multiple devices. I haven't used the new grouping functionality for that yet, so one button press results in multiple commands. These commands are being sent, so on the domoticz/nodered side of things, it's working fine. I can see the commands being sent.
Set level always works fine. So the dimmer functionality always works fine, but as soon as I want to change scene, and do that multiple times in a row, it clogs up, and this can easily take like 40 seconds to a minute. I have children who then keep on pressing the buttons, which results in, when it picks up again, a discoshow for like 2 minutes, when it is 'replaying' the list of commands.
I haven't really looked into the code yet, but it feels to me like this app needs threading.
Also what I've noticed; when started the first time, on/off commands for lights do not work. First you have to set a level, which does work, before on/off commands to work. In general on/of does not work very wel, set level works way better.
Are you open to have other devvers join?