dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.9k stars 506 forks source link

Unreliable communication with devices #2535

Closed Equidamoid closed 4 years ago

Equidamoid commented 4 years ago

I have random devices not responding to API commands and PWA controls from time to time. The problem is not limited to particular brand. I had Hue, Tradfri and Osram devices ignoring the calls in the last week. The "all off" PWA button always works. Recovery happens spontaneously after couple of hours (sometimes -- days) or after powercycling the device or RPi running deCONZ.

Some particular devices tend to fail more often that the others. It feels like the devices more distant from my rapsbee fail more frequently ("relayed" messages getting lost?), can't say that it is not an observer bias though.

I see several similar open issues, but they all mention specific brand of failing device, therefore creating this bug for a "vendor-agnostic" problem.

Smanar commented 4 years ago

Perhaps a connexion issue (You have only that on bulb, not on sensor ?) ? Have you the GUI to check the connection (it just give an idea, don't take results as it). Have you try with a USB extension cable ?

Have you a bulb alone (not in group, not in scenario) that have same problem too ?

Equidamoid commented 4 years ago

@Smanar,

Have you the GUI to check the connection

I have the same problem with sensors (or, in general, "sleepy" devices, so also buttons). I decided not to mention them in the original description to avoid complexity, Also, if power plugs and bulbs disappear, it's not a surprise that end devices that depend on the nearest bulb also stop working from time to time.

Have you the GUI to check the connection

I can try to check the gui if you tell me what to look at there. Although it will be an invasive check since I'll have to restart deCONZ to see the gui.

Have you try with a USB extension cable ?

I don't understand the question. Do you mean replacing the wire powering the RPi?

Have you a bulb alone (not in group, not in scenario) that have same problem too ?

I have all the bulbs/sockets/whatever in the same network. I do have some "deconz groups", all the automations use per-device calls (/lights/X/...).

Smanar commented 4 years ago

Something like that https://www.phoscon.de/conbee2/img/usb-cable.svg

It moves away the conbee from perturbation from raspberry (bluetooth, wifi, magnetism) it can increase connection.

Equidamoid commented 4 years ago

As I mentioned, I have a raspbee device, the one that plugs directly to Pi's headers. So can't do :(

Equidamoid commented 4 years ago

Also, I don't use wifi/bt on that raspberry. For an experiment, I'll even rfkill pi's radios and move it a bit further away (~0.7m) from wifi router (and also closer to the nearby zigbee bulbs) as long as the wires allow.

As a random idea, is it possible to detect such interference on deCONZ side?

Equidamoid commented 4 years ago

For the first hours after moving the Pi everything worked fine, but most likely due to reboot. Now at least two bulbs (one Tradfri and one Hue) are not responding again.

manup commented 4 years ago

You may try the new firmware 0x26350500 for RaspBee I it should improve on keeping the routes alive.

https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-596206539

Equidamoid commented 4 years ago

@manup Flashing it right now! Thanks! Can I maybe help with testing? Like collect extra logs or stress the network somehow?

manup commented 4 years ago

Can I maybe help with testing? Like collect extra logs or stress the network somehow?

Thanks, for now it would help to know if stability of the network improves with the new version.

Equidamoid commented 4 years ago

@manup, I'm afraid, I don't have any good news. And while I don't have any statistics to judge if the situation is improved, the problem I have is definitely not fixed with that firmware.

This morning one of the lamps just stopped responding again. No reaction on API calls, "changed" events are coming via websocket as if everything is fine.

The lamp in question is surrounded by 7 other lamps all within ~1.5m and the rest are working fine. My Pi with raspbee is ~4 m away behind a couple of walls.

djwlindenaar commented 4 years ago

@Equidamoid , you may need to power cycle the lights after updating the deconz firmware.

Also, you might be affected by the issue which is solved by #2551 . You could build that one yourself.

However, looking at your first post, it being random devices, you may be having a completely unrelated issue. Speaking from my own experience, your best bet for debugging this is to get yourself a sniffer (CC2531) and look at what's happening on the network around the time such a device goes awol.

Equidamoid commented 4 years ago

@djwlindenaar Oh, rebooting the whole apartment, that will take some effort... The change you mention should be in deconz-dev-2.05.75.deb already, right? I'll try that one.

"Random" is to vague of a term. There are some patterns. I haven't seen some devices fail. They seem to mostly be in direct sight of raspbee. One osram plug in a room nearby works in 100% of the cases, the other in 0.5m only complies in ~30% of the days.

Is it possible to collect the data using another raspbee? I've got one some time ago to try to make some "raw zigbee to rpc" interface, but failed miserably due to unstable network. Blamed it on the second raspbee in the same network (configured as "router") and abandoned the project, but now I'm not so sure anymore.

djwlindenaar commented 4 years ago

I think that's what zshark is for, right?

Btw. Rebooting the whole apartment is easy. Just flip the main breaker switch :smile:

Equidamoid commented 4 years ago

Looks like it. Although only conbee is mentioned in like 90% of the cases, which is a bit confusing. I'll give it a try in the coming days, although I start getting horrifying flashbacks about adding a second raspbee to the network, guessing keys, "nothing works without any visible errors", etc. %)

There is no way of getting the "credentials" for joining the network without restarting deCONZ in GUI mode, right?

Yeah, but there is an "except the PC" part to the whole apartment that I omitted :D Anyway, I installed the .75 deb and powercycled I believe all the devices one by one. Let's see how it goes in a day or two...

UPD: zshark works out of the box! great job guys! Now I see something like 10-15 messages/sec pretty much correlated with my API calls. What next?

ebaauw commented 4 years ago

It shouldn't be a problem to add a second RaspBee or ConBee to an existing network. Just configure it as a router, with an empty network key, and pair it to the network (i.e. open the network from the coordinator and then join the network on the router). It should receive the network key on pairing.

Note that running two gateways on the same network is asking for trouble, as both try and configure devices to report to them. Better not pair create the REST API resources on the deCONZ instance connected to the router (or disable its REST API plugin) and only use the GUI.

There is no way of getting the "credentials" for joining the network without restarting deCONZ in GUI mode, right?

Sniff the traffic while pairing a device. Make sure to configure the ZHA link key (5a:69:67:42:65:65:41:6c:6c:69:61:6e:63:65:30:39, ZigbeeAlliance09) in WireShark (under Preferences|Protocols|ZigBee), since that's used to encrypt the network key. Note the message where the key is exchanged, note it down (also as backup when you need to restore the network configuration on the coordinator). Wireshark will apply it automatically to decrypt messages in the current session, but you want to configure it for future sessions.

Equidamoid commented 4 years ago

@ebaauw, thank you for the details! I have the decoded data in wireshark now. The problem does not happen right now (as usual, bugs hide once you get a debug tool), but I will keep capturing logs until something happens again. Should I keep an eye on some specific "routing error" messages?

As a side note, @manup could you please change the help in zshark to suggest capture filter udp port 17754 instead of display filter. This should insanely reduce the size of the capture files.

Equidamoid commented 4 years ago

I think I managed to capture one occurrence. Not sure if it is a coincidence or an improvement due to fixes, but the lamp started working within a minute after I saw it not responding.

Now looking at the log, couple of questions:

Equidamoid commented 4 years ago

Okay, I have around 47 hours of log. I see the specific "Move to level with ononoff" call after which there is no ACK There is a physically reasonable "Route Record" being sent around ~7 min before I noticed the problem. For the lamp nearby the same command results in an ACK.

I don't see both "level witn onoff" message being passed around by other nodes.

And of course with my merely nonexistent understanding of how it all supposed to work all the stuff above may be wrong. How do we proceed now?

Equidamoid commented 4 years ago

Tested the updated system for almost a month. Looks like the problem is gone now.

Sometimes a light or two still ignores the command, but repeating it usually gets things done, so it's a completely different and much less severe problem.

I consider this bug fixed.