Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
456 stars 277 forks source link

Possible routing issue(?): Timeout - 57587 - 1 - 246 - 513 - 4 after 10000ms #136

Closed ginkel closed 4 years ago

ginkel commented 4 years ago

Hi there,

I've got a Zigbee network with the following components:

Today, one SPZB0001 (0xE0F3) became unreachable. It still periodically sends data requests to what it considers its parent (a Tradfri bulb 0x63B), but never gets anything back. Requests addressed to the SPZB0001 end with a timeout such as:

zigbee2mqtt:error 2020-02-10 18:14:32: Publish 'set' 'current_heating_setpoint' to 'climate.thermostat_<redacted>' failed: 'Error: Write 0x00158d0001ffb027/1 hvacThermostat({"16387":{"value":1850,"type":41}}, {"timeout":10000,"defaultResponseTimeout":15000,"manufacturerCode":4151,"disableDefaultResponse":true}) failed (Error: Timeout - 57587 - 1 - 246 - 513 - 4 after 10000ms)'
zigbee2mqtt:debug 2020-02-10 18:14:32: Error: Write 0x00158d0001ffb027/1 hvacThermostat({"16387":{"value":1850,"type":41}}, {"timeout":10000,"defaultResponseTimeout":15000,"manufacturerCode":4151,"disableDefaultResponse":true}) failed (Error: Timeout - 57587 - 1 - 246 - 513 - 4 after 10000ms)
    at Endpoint.<anonymous> (/app/node_modules/zigbee-herdsman/dist/controller/model/endpoint.js:150:23)
    at Generator.throw (<anonymous>)
    at rejected (/app/node_modules/zigbee-herdsman/dist/controller/model/endpoint.js:6:65)

Using a Zigbee sniffer I found out that the reasons seems to be that the coordinator sends the requests to 0xad22, the Tradfri repeater, which previously used to be the parent of the SPZB0001:

image

So, the request ends up on 0xad22, but the SPZB0001 0xE0F3 tries to retrieve it from 0x63B:

image

Any ideas?

Edit: Apologies if I picked the wrong project - not sure if this is zigbee-herdsman, zigbee2mqtt or firmware

Edit^2: In the network map I can see that 0x63B is considered the parent of the SPZB0001. 0x63B does not have a direct link to the coordinator, so attempting to reach it via a router would make sense. I am not that familiar with Zigbee yet, so I am uncertain whether the route should manifest in the Wireshark dump or whether I should observer any communication between the Tradfri repeater and the bulb regarding the SPZB0001.

Koenkk commented 4 years ago

This is indeed a bug in the zStack firmware when source routing is enabled, see https://github.com/Koenkk/zigbee2mqtt/issues/1408 (unfortunately we cannot fix this).

As a workaround you can use the default firmware (which doesn't has source routing).

ginkel commented 4 years ago

The interesting thing is that I am not using the source routing firmware, but the default one (at least I wasn't when the problem happened, I experimentally switched afterwards).

Apart from that: If I may ask, why do you think this cannot be fixed?

Koenkk commented 4 years ago

It's interesting if this also happens with the default one, can you share the wireshark log when trying that? Do you see a Source route failure message in that case?

ginkel commented 4 years ago

I do not even see a route request for 0xE0F3, so the coordinator was probably completely unaware that something may be wrong with the route (except for the timeouts). I can share the pcap, but would prefer to do this privately (along with the required key).

Edit: To avoid confusion when later reading the pcap: There are plenty of route requests for 0xB101, which is a bulb that was physically turned off at that time. That should, however, be completely unrelated.

Koenkk commented 4 years ago

You can contact me on telegram (@koenkk)

ginkel commented 4 years ago

You can contact me on telegram (@Koenkk)

Done.

Koenkk commented 4 years ago

Attempt to fix this is available in the latest dev branch.

After a timeout it will now do a route request and retry the command.

dzungpv commented 4 years ago

@Koenkk i try the dev but problem still the same. In my network there is all end device and only 1 router, it is xiaomi lock (it running on battery and no device will connect through it as my test) and treat as end device here https://github.com/Koenkk/zigbee2mqtt/pull/1764, but to still show 205 error frequently. There is no way to solve the problem unless reset the end devices

Koenkk commented 4 years ago

@dzungpv can you share a bit more information of your step? Firmware version, adapter hardware, list of devices in the network.

dzungpv commented 4 years ago

@Koenkk it is CC1352P_2_20191106 firmware, end device are 10 DNCKATSW001 base on TI sample light Zigbee HA 1.2, 1 Aqara lock ZNMS12LM (show as router in the log and map). I also try enable child aging but it is the same. So i guest with new zigbee-herdsman code here https://github.com/Koenkk/zigbee2mqtt/pull/1764 not affect anymore?

Koenkk commented 4 years ago

@dzungpv could you sniff the network traffic of this? https://www.zigbee2mqtt.io/how_tos/how_to_sniff_zigbee_traffic.html

dzungpv commented 4 years ago

@Koenkk I have capture it but not save, i will retry when it occur again. When it happen coordinator send Route Discovery request.

dzungpv commented 4 years ago

@Koenkk i have found that it cause by my end device, that end device could send status to Coordinator but it not send data request frequently (it stop sending, for unknown reason), so Coordinator could not control it. I will send packet and log to your telegram for debug

ginkel commented 4 years ago

Running dev with the fix included since last weekend. So far the issue hasn't surfaced again.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.