Closed Equidamoid closed 4 years ago
Perhaps a connexion issue (You have only that on bulb, not on sensor ?) ? Have you the GUI to check the connection (it just give an idea, don't take results as it). Have you try with a USB extension cable ?
Have you a bulb alone (not in group, not in scenario) that have same problem too ?
@Smanar,
Have you the GUI to check the connection
I have the same problem with sensors (or, in general, "sleepy" devices, so also buttons). I decided not to mention them in the original description to avoid complexity, Also, if power plugs and bulbs disappear, it's not a surprise that end devices that depend on the nearest bulb also stop working from time to time.
Have you the GUI to check the connection
I can try to check the gui if you tell me what to look at there. Although it will be an invasive check since I'll have to restart deCONZ to see the gui.
Have you try with a USB extension cable ?
I don't understand the question. Do you mean replacing the wire powering the RPi?
Have you a bulb alone (not in group, not in scenario) that have same problem too ?
I have all the bulbs/sockets/whatever in the same network. I do have some "deconz groups", all the automations use per-device calls (/lights/X/...
).
Something like that https://www.phoscon.de/conbee2/img/usb-cable.svg
It moves away the conbee from perturbation from raspberry (bluetooth, wifi, magnetism) it can increase connection.
As I mentioned, I have a raspbee device, the one that plugs directly to Pi's headers. So can't do :(
Also, I don't use wifi/bt on that raspberry. For an experiment, I'll even rfkill pi's radios and move it a bit further away (~0.7m) from wifi router (and also closer to the nearby zigbee bulbs) as long as the wires allow.
As a random idea, is it possible to detect such interference on deCONZ side?
For the first hours after moving the Pi everything worked fine, but most likely due to reboot. Now at least two bulbs (one Tradfri and one Hue) are not responding again.
You may try the new firmware 0x26350500 for RaspBee I it should improve on keeping the routes alive.
https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-596206539
@manup Flashing it right now! Thanks! Can I maybe help with testing? Like collect extra logs or stress the network somehow?
Can I maybe help with testing? Like collect extra logs or stress the network somehow?
Thanks, for now it would help to know if stability of the network improves with the new version.
@manup, I'm afraid, I don't have any good news. And while I don't have any statistics to judge if the situation is improved, the problem I have is definitely not fixed with that firmware.
This morning one of the lamps just stopped responding again. No reaction on API calls, "changed" events are coming via websocket as if everything is fine.
The lamp in question is surrounded by 7 other lamps all within ~1.5m and the rest are working fine. My Pi with raspbee is ~4 m away behind a couple of walls.
@Equidamoid , you may need to power cycle the lights after updating the deconz firmware.
Also, you might be affected by the issue which is solved by #2551 . You could build that one yourself.
However, looking at your first post, it being random devices, you may be having a completely unrelated issue. Speaking from my own experience, your best bet for debugging this is to get yourself a sniffer (CC2531) and look at what's happening on the network around the time such a device goes awol.
@djwlindenaar Oh, rebooting the whole apartment, that will take some effort... The change you mention should be in deconz-dev-2.05.75.deb already, right? I'll try that one.
"Random" is to vague of a term. There are some patterns. I haven't seen some devices fail. They seem to mostly be in direct sight of raspbee. One osram plug in a room nearby works in 100% of the cases, the other in 0.5m only complies in ~30% of the days.
Is it possible to collect the data using another raspbee? I've got one some time ago to try to make some "raw zigbee to rpc" interface, but failed miserably due to unstable network. Blamed it on the second raspbee in the same network (configured as "router") and abandoned the project, but now I'm not so sure anymore.
I think that's what zshark is for, right?
Btw. Rebooting the whole apartment is easy. Just flip the main breaker switch :smile:
Looks like it. Although only conbee is mentioned in like 90% of the cases, which is a bit confusing. I'll give it a try in the coming days, although I start getting horrifying flashbacks about adding a second raspbee to the network, guessing keys, "nothing works without any visible errors", etc. %)
There is no way of getting the "credentials" for joining the network without restarting deCONZ in GUI mode, right?
Yeah, but there is an "except the PC" part to the whole apartment that I omitted :D Anyway, I installed the .75
deb and powercycled I believe all the devices one by one. Let's see how it goes in a day or two...
UPD: zshark works out of the box! great job guys! Now I see something like 10-15 messages/sec pretty much correlated with my API calls. What next?
It shouldn't be a problem to add a second RaspBee or ConBee to an existing network. Just configure it as a router, with an empty network key, and pair it to the network (i.e. open the network from the coordinator and then join the network on the router). It should receive the network key on pairing.
Note that running two gateways on the same network is asking for trouble, as both try and configure devices to report to them. Better not pair create the REST API resources on the deCONZ instance connected to the router (or disable its REST API plugin) and only use the GUI.
There is no way of getting the "credentials" for joining the network without restarting deCONZ in GUI mode, right?
Sniff the traffic while pairing a device. Make sure to configure the ZHA link key (5a:69:67:42:65:65:41:6c:6c:69:61:6e:63:65:30:39, ZigbeeAlliance09) in WireShark (under Preferences|Protocols|ZigBee), since that's used to encrypt the network key. Note the message where the key is exchanged, note it down (also as backup when you need to restore the network configuration on the coordinator). Wireshark will apply it automatically to decrypt messages in the current session, but you want to configure it for future sessions.
@ebaauw, thank you for the details! I have the decoded data in wireshark now. The problem does not happen right now (as usual, bugs hide once you get a debug tool), but I will keep capturing logs until something happens again. Should I keep an eye on some specific "routing error" messages?
As a side note, @manup could you please change the help in zshark to suggest capture filter udp port 17754
instead of display filter. This should insanely reduce the size of the capture files.
I think I managed to capture one occurrence. Not sure if it is a coincidence or an improvement due to fixes, but the lamp started working within a minute after I saw it not responding.
Now looking at the log, couple of questions:
No. Time Source Destination Protocol Length Info
2307732 169497.144946475 0x0000 0xc32f ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 244
2307744 169497.301191062 0x0000 0xd670 ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 246
2307841 169498.579085764 0x0000 0xc75e ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 247
2307893 169499.235340482 0x0000 0xc32f ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 244
2307936 169499.774261761 0x0000 0x48ea ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 249
2307944 169499.892365648 0x0000 0xf38c ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 251
2307955 169500.034585081 0x0000 0xc32f ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 254
2307965 169500.150744276 0x0000 0xd670 ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 0
2307967 169500.185211679 0x0000 0xc75e ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 1
2307968 169500.201178256 0x0000 0xc75e ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 1
2308064 169501.359800525 0x0000 0x48ea ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 249
2308065 169501.376179179 0x0000 0x48ea ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 249
2308066 169501.392249790 0x0000 0x48ea ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 249
2308082 169501.601420210 0x0000 0xc75e ZigBee HA 125 ZCL Level Control: Move to Level with OnOff, Seq: 1
zbee_aps.zdp_cluster == 0x8000
)Okay, I have around 47 hours of log. I see the specific "Move to level with ononoff" call after which there is no ACK There is a physically reasonable "Route Record" being sent around ~7 min before I noticed the problem. For the lamp nearby the same command results in an ACK.
I don't see both "level witn onoff" message being passed around by other nodes.
And of course with my merely nonexistent understanding of how it all supposed to work all the stuff above may be wrong. How do we proceed now?
Tested the updated system for almost a month. Looks like the problem is gone now.
Sometimes a light or two still ignores the command, but repeating it usually gets things done, so it's a completely different and much less severe problem.
I consider this bug fixed.
I have random devices not responding to API commands and PWA controls from time to time. The problem is not limited to particular brand. I had Hue, Tradfri and Osram devices ignoring the calls in the last week. The "all off" PWA button always works. Recovery happens spontaneously after couple of hours (sometimes -- days) or after powercycling the device or RPi running deCONZ.
Some particular devices tend to fail more often that the others. It feels like the devices more distant from my rapsbee fail more frequently ("relayed" messages getting lost?), can't say that it is not an observer bias though.
I see several similar open issues, but they all mention specific brand of failing device, therefore creating this bug for a "vendor-agnostic" problem.