Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
456 stars 277 forks source link

Zigbee-Herdsman error handling and error reporting #376

Closed asgothian closed 2 years ago

asgothian commented 2 years ago

The GitHub user @Hedda opened an issue (Issue #1110 at iobroker.zigbee) to verify if/how the iobroker.zigbee adapter can use the current efforts towards ZNP Adapter Backup for interoperability between zigbee2mqtt and iobroker.zigbee and a possible chance for a near zero configuration approach for iobroker.zigbee.

I have run a check on how the iobroker.zigbee adapter could interface with these systems, and have come up with one detail which makes the data exchange of configuration data and issues difficult. As it stands now, the Zigbee Herdsman will raise issues with messages when something does not go as planned. This includes (but is not limited to) errors during setup and configuration mismatch. I could not find a method to query the nvbackup information without attempting to use these to start zigbee network.

Additionally, reacting correctly to the errors raised by the herdsman is made increasingly complex due to the need to parse the error messages raised by the herdsman.

I would propose to introduce a set of numeric error codes into the errors raised by the herdsman in order to allow for calling system to clearly identify the error in detail. As we currently are not able to identify the cause of the issue there are currently just two possible reactions to the raised errors:

I am working on tracking down and eliminating these issues. This work would be made a lot easier by having unique error ID's paired with a table defining the state of the herdsman when this error is raised and how to proceed to correctly reset and possibly reinitialize and restart the zigbee-herdsman.

Is this something which can be integrated in the herdsman ? Is this of interest for the zigbee-herdsman project ? I can assist with some of the work, but my toolchain does not currently allow for developing / testing typescript code.

A.

Hedda commented 2 years ago

Ping @castorw who made https://github.com/Koenkk/zigbee-herdsman/pull/303 and @puddly for the indirectly related https://github.com/zigpy/zigpy/issues/557 format discussion about interoperability.

FYI, both of those also reference "Open ZigBee Coordinator Backup Format" -> https://github.com/zigpy/open-coordinator-backup

Open ZigBee Coordinator Backup Format

An ongoing specification for a JSON Zigee coordinator backup format as a hardware-independent non-proprietary open standard.

Rationale

The primary use case for this specification is to provide a simple standard for exporting ZigBee network information to disk. Standardizing this format will allowing users to backup, restore, and migrate their networks between coordinator hardware and network management software without having to needlessly rejoin all of their devices to a new network.

Implementation adoption

This Open ZigBee Coordinator Backup Format has so far been adopted by several popular open source home automation projects, including zigpy (Zigbee library used by Home Assistant and Jeedom) and zigbee-herdsman (Zigbee library used by Zigbee2MQTT and IoBroker).

Provisional Specification

JSON is both human-readable supported well by common programming languages, making it a good fit for files that will be interpreted by both.

Koenkk commented 2 years ago

@asgothian what specific errors are you running into? Some of the things mentioned in https://github.com/ioBroker/ioBroker.zigbee/issues/1110 shouldn't prevent herdsman from starting. e.g.

Reversed ExtPanID in some cases

In this case it should start fine since this is allowed by the code

ExtPanID dddddddddddddddd changed to Coordinator MACID -> Zigbee network must be reprovisioned and all devices must be paired anew.

When the ext pan id is set to 0xdddddddddddddddd herdsman will allow every extended pan id on the adapter.

asgothian commented 2 years ago

Hmm.. I am getting to tired - it seems GitHub ate the comment when i tried to edit it. I will repost tomorrow, sorry about that.

asgothian commented 2 years ago

As of yesterday, we have a lot of issues with people updating from an older adapter version (which includes an older herdsman version) to the current stable version which links to the zigbee-herdsman 0.13.110. For Test purposes, i have a branch with the herdsman version 0.13.0107, but this mostly generates the same errors. Usually, things can be fixed by going back to iobroker.zigbee version 1.4.1 (i cannot access the herdsman version of that, but it should be about 6 months old) without configuration changes - with all devices staying in the zigbee network. I am still working on pinpointing the change which is cause for this, but as i do not have the error myself (yet) it is very slow going.

Note that these issues are just the trigger for the request, not the reason. We currently have a large number of people wary of updating the zigbee adapter as they are unsure if the update (which also updates the herdsman) will have them loose their zigbee networks, which raises the pressure. My intention with having unique error Identifiers for errors raised in the Herdsman is to offer better control over what to report to the user and how to report it based on the actual error ID. I have already started to parse certain errors which are not passed to the iobroker log as errors but as warnings instead (e.g. no response from device, no route to network and similar messages). Being able to isolate 'critical' from 'noncritical' errors would be a big step up and would go a long way towards inspiring confidence in the entire system.

Here are some of the typical log entries we see:

 2021-06-01 18:19:19.813 - info: host.raspberrypi "system.adapter.zigbee.0" enabled
 2021-06-01 18:19:19.910 - info: host.raspberrypi instance system.adapter.zigbee.0 started with pid 28157
 2021-06-01 18:19:22.834 - info: zigbee.0 (28157) starting. Version 1.5.6 in /opt/iobroker/node_modules/iobroker.zigbee, node: v12.22.1, js-controller: 3.2.16
 2021-06-01 18:19:22.932 - info: zigbee.0 (28157) Starting Zigbee npm ...
 2021-06-01 18:19:23.245 - info: zigbee.0 (28157) Installed Version: iobroker.zigbee@1.5.6
 2021-06-01 18:19:24.890 - error: zigbee.0 (28157) Configuration is not consistent with adapter state/backup!
 2021-06-01 18:19:24.891 - error: zigbee.0 (28157) - PAN ID: configured=6754, adapter=6754
 2021-06-01 18:19:24.892 - error: zigbee.0 (28157) - Extended PAN ID: configured=aadddddddddddddd, adapter=00124b0009d6d8b0
 2021-06-01 18:19:24.893 - error: zigbee.0 (28157) - Network Key: configured=01030507090b0d0f00020406080a0c0d, adapter=01030507090b0d0f00020406080a0c0d
 2021-06-01 18:19:24.893 - error: zigbee.0 (28157) - Channel List: configured=24, adapter=24
 2021-06-01 18:19:24.894 - error: zigbee.0 (28157) Please update configuration to prevent further issues.
 2021-06-01 18:19:24.894 - error: zigbee.0 (28157) If you wish to re-commission your network, please remove coordinator backup at /opt/iobroker/iobroker-data/zigbee_0/nvbackup.json.
 2021-06-01 18:19:24.895 - error: zigbee.0 (28157) Re-commissioning your network will require re-pairing of all devices!
 2021-06-01 18:19:24.896 - error: zigbee.0 (28157) Starting zigbee-herdsman problem : "startup failed - configuration-adapter mismatch - see logs above for more information"
 2021-06-01 18:19:24.896 - error: zigbee.0 (28157) Failed to start Zigbee
 2021-06-01 18:19:24.897 - error: zigbee.0 (28157) Error herdsman start
 2021-06-01 18:19:34.898 - info: zigbee.0 (28157) Try to reconnect. 1 attempts left
 2021-06-01 18:19:34.900 - info: zigbee.0 (28157) Starting Zigbee npm ...
 2021-06-01 18:19:34.934 - info: zigbee.0 (28157) Installed Version: iobroker.zigbee@1.5.6
 2021-06-01 18:19:35.137 - error: zigbee.0 (28157) Starting zigbee-herdsman problem : "Error while opening serialport 'Error: Error Resource temporarily unavailable Cannot lock port'"
 2021-06-01 18:19:35.138 - error: zigbee.0 (28157) Failed to start Zigbee
 2021-06-01 18:19:35.139 - error: zigbee.0 (28157) Error herdsman start
2021-05-31 12:14:40.868 - error: zigbee.0 (4983) Starting zigbee-herdsman problem : "network commissioning failed - timed out waiting for nib to settle"
2021-05-31 12:14:40.869 - error: zigbee.0 (4983) Failed to start Zigbee
2021-05-31 12:14:40.870 - error: zigbee.0 (4983) Error herdsman start
zigbee.0    2021-06-01 17:33:17.898 error   (4119) Error herdsman start
 zigbee.0   2021-06-01 17:33:17.898 error   (4119) Failed to start Zigbee
 zigbee.0   2021-06-01 17:33:17.897 error   (4119) Starting zigbee-herdsman problem : "startup failed - configuration-adapter mismatch - see logs above for more information"
 zigbee.0   2021-06-01 17:33:17.897 error   (4119) Re-commissioning your network will require re-pairing of all devices!
 zigbee.0   2021-06-01 17:33:17.896 error   (4119) If you wish to re-commission your network, please remove coordinator backup at /opt/iobroker/iobroker-data/zigbee_0/nvbackup.json.
 zigbee.0   2021-06-01 17:33:17.896 error   (4119) Please update configuration to prevent further issues.
 zigbee.0   2021-06-01 17:33:17.895 error   (4119) - Channel List: configured=11, adapter=
 zigbee.0   2021-06-01 17:33:17.895 error   (4119) - Network Key: configured=01030507090b0d0f00020406080a0c0d, adapter=01030507090b0d0f00020406080a0c0d
 zigbee.0   2021-06-01 17:33:17.894 error   (4119) - Extended PAN ID: configured=00124b001940c090, adapter=0000000000000000
 zigbee.0   2021-06-01 17:33:17.894 error   (4119) - PAN ID: configured=6754, adapter=65535
 zigbee.0   2021-06-01 17:33:17.893 error   (4119) Configuration is not consistent with adapter state/backup!

A.

castorw commented 2 years ago

Hi there @asgothian,

log 1: The case here is pretty clear, configured EPID is kinda non-sense (0xaadddddddddddddd) and the error is pretty clear.

log 2: This may happen if the adapter hangs or refuses to commission network. Is occurrence of this error regular or random? What hardware/firmware is being used?

log 3: This seems to be a very interesting state of adapter's NV and this is the second case I see. Could you provide information on device/firmware being used and a NV dump would be extremely useful (use https://github.com/zigpy/zigpy-znp/blob/dev/zigpy_znp/tools/nvram_read.py). By very interesting state I mean - it shouldn't work as coordinator at all.

Thank you.

Koenkk commented 2 years ago

I'm confused how zigbee.0 2021-06-01 17:33:17.894 error (4119) - PAN ID: configured=6754, adapter=65535 worked previously, since the old code also checked the panID so I would also expect it to reconfigure there. This was done here: https://github.com/Koenkk/zigbee-herdsman/blob/v0.13.94/src/adapter/z-stack/adapter/startZnp.ts#L110 .

@asgothian could you provide the herdsman debug log of the old herdsman version with this adapter?

castorw commented 2 years ago

I'm confused how zigbee.0 2021-06-01 17:33:17.894 error (4119) - PAN ID: configured=6754, adapter=65535 worked previously, since the old code also checked the panID so I would also expect it to reconfigure there. This was done here: https://github.com/Koenkk/zigbee-herdsman/blob/v0.13.94/src/adapter/z-stack/adapter/startZnp.ts#L110 .

@Koenkk not really. The check mentioned by you was against the NV configured PANID. The current check runs against NIB. So it looks like inconsistent NIB, ergo possibly very interesting firmware state.

Koenkk commented 2 years ago

@castorw I see now. I did not confirm this, but I believe the CC2538 (since I've only got reports about this issue with a CC2538) allows starting with a different panID than provided when the provided panID is already occupied. The CC2531 1.2 and CC2652R/CC1352P refuse to start in this case. @Hacker-CB @0x3EC can you confirm this? If this is the case the firmware should be fixed.

kirovilya commented 2 years ago

current firmware for cc2538 starts correctly. it seems to me that old firmware versions behave this way

asgothian commented 2 years ago

Hi there @asgothian,

log 1: The case here is pretty clear, configured EPID is kinda non-sense (0xaadddddddddddddd) and the error is pretty clear.

log 2: This may happen if the adapter hangs or refuses to commission network. Is occurrence of this error regular or random? What hardware/firmware is being used?

I have seen this once. It is constant, on a CC2538 chip. I can query the firmware version from the user.

log 3: This seems to be a very interesting state of adapter's NV and this is the second case I see. Could you provide information on device/firmware being used and a NV dump would be extremely useful (use https://github.com/zigpy/zigpy-znp/blob/dev/zigpy_znp/tools/nvram_read.py). By very interesting state I mean - it shouldn't work as coordinator at all.

Thank you.

What is interesting is that all 3 occurrences are claimed fixed when starting the 1.4.1 iobroker.zigbee adapter, which is why i expect that the configuration is not an issue.

I will see if the affected people can provide respective details as to firmware version et al, but i do not have high hopes there.

That said - fixing these issues is not what i made this issue for. I am more interested in a change in how the herdsman provides error / problem information in order for a client application to actually react to it in a meaningful way.

A.

asgothian commented 2 years ago

I'm confused how zigbee.0 2021-06-01 17:33:17.894 error (4119) - PAN ID: configured=6754, adapter=65535 worked previously, since the old code also checked the panID so I would also expect it to reconfigure there. This was done here: https://github.com/Koenkk/zigbee-herdsman/blob/v0.13.94/src/adapter/z-stack/adapter/startZnp.ts#L110 .

@Koenkk not really. The check mentioned by you was against the NV configured PANID. The current check runs against NIB. So it looks like inconsistent NIB, ergo possibly very interesting firmware state.

I have another case with this situation at this issue. Asking for the requested data. @castorw - can you join the conversation on that issue so we can track down what is going on ?

castorw commented 2 years ago

@asgothian I have joined the issue. Sorry for late reply, I have been pretty unavailable for the last few weeks.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

SvenEng commented 2 years ago

I am having the same error as in log 2. I run zigbee2mqtt on ubuntu with a CC2531. Deubug log:


  zigbee-herdsman:adapter:zStack:startup:commissioning waiting for NIB to settle                                                                                                              +3s
  serialport/stream _read reading { start: 603, toRead: 64933 } +2ms
  serialport/binding-abstract read +1s
  serialport/bindings/unixRead Starting read +2ms
  serialport/bindings/unixRead read error [Error: EAGAIN: resource temporarily u                                                                                                             navailable, read] {
  errno: -11,
  code: 'EAGAIN',
  syscall: 'read'
} +0ms
  serialport/bindings/unixRead waiting for readable because of code: EAGAIN +0ms
  serialport/bindings/poller Polling for "readable" +3ms
  serialport/bindings/poller error [Error: bad file descriptor] +52ms
  serialport/stream binding.read error [Error: bad file descriptor] +53ms
  serialport/stream disconnected [Error: bad file descriptor] +0ms
  serialport/stream #close +1ms
  serialport/binding-abstract close +54ms
  serialport/stream _read queueing _read for after open +0ms
  serialport/bindings/poller Stopping poller +2ms
  serialport/bindings/poller Destroying poller +1ms
  serialport/stream binding.close finished +2ms
  zigbee-herdsman:adapter:zStack:znp:log Port closed +6s```
castorw commented 2 years ago

@SvenEng This looks more like the adapter resets the USB. Are you running the latest recommended firmware version?

SvenEng commented 2 years ago

My problem was fixed after reflashing the CC2531 with the exact same firmware. Still weird that the problem occured after it worked fine for a couple of weeks.

castorw commented 2 years ago

@SvenEng CC2531 adapters are known to get flash corruption sometimes - this may affect NV stored data, but firmware as well. All you can do is reflash them and restore from backup.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days