Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.7k stars 1.64k forks source link

Zigbee network down, log errors "Command failed with status '(0x02: INVALID_PARAM)'" #19566

Open Stadicus opened 10 months ago

Stadicus commented 10 months ago

What happened?

Starting yesterday, my extensive Zigbee network (130 devices) suddenly stops working after some time. Z2M itself is working (green) and the coordinator is running well (restarted, reflashed), but devices fail to provide updates (HomeAssistant shows all devices "Last seen 4 hours ago") and the log is full with error messages like the following:

Publish 'set' 'state' to 'XXX' failed: 'Error: Command 0xf4ce369957c8cd8c/7 genOnOff.off({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> AF - dataRequest - {"dstaddr":11364,"destendpoint":7,"srcendpoint":1,"clusterid":6,"transid":126,"options":0,"radius":30,"len":3,"data":{"type":"Buffer","data":[1,41,0]}}' failed with status '(0x02: INVALID_PARAM)' (expected '(0x00: SUCCESS)'))'

Nothing physical changed in the network, and the coordinator runs via PoE and is far away from any wifi access point. I ran a few Home Assistant services updates lately, but can't quite pinpoint what was affected. I did not yet try to restore the whole setup from a backup.

What did you expect to happen?

I'd expect devices to respond, or some error to show up in the frontend, indicating what could be wrong.

How to reproduce it (minimal and precise)

Restarting everything works without issue and does not show any errors:

A few devices get the "Just now" status in "Last seen" column in the web interface, typically less than 10 devices from 130+. Then errors start pouring in.

Every few complete reboots, the network seems to work normally again, but it will stop functioning after some time. I can't reproduce the "working state" yet.

Zigbee2MQTT version

1.33.2-1

Adapter firmware version

20230507

Adapter

HamGeek POE Zigbee 3.0 Coordinator CC2652P

Debug log

Here's the full debug log including herdsman debugging copied from Docker:

20231104-213118-z2m-logs.txt

Stadicus commented 10 months ago

Another surprising factlet, not sure if relevant: when I restart the Z2M Home Assistant addon, the "Last seen" times in the web interface are not persisted. Even devices that just got detected before restarting are shown as last seen many hours/days ago.

msaario commented 10 months ago

I experienced same or very similar issue after updating to 1.33.2-1. Many devices did no longer respond to commands and I received errors or timeout. After reverting back to version 1.33.1-1, everything responds again.

adekloet commented 10 months ago

Please revert off of firmware version 20230507 (use the older 20221226) firmware 0570 is not stable.

Stadicus commented 10 months ago

ease revert off of firmware version 20230507 (use the older 20221226) firmware 0570 is not stable.

Thanks. I did that and it improves the reliability of the network substantially. When i restart Z2M, all devices respond and the issues I've noticed before with the whole network being unresponsive are gone.

I experienced same or very similar issue after updating to 1.33.2-1

The update to 1.33.2-1 might still cause issues. The original issue came up when I wanted to join a new Zigbee light, and this still causes the network reproducibly to become unresponsive. After a Z2M service restart, the network behaves fine. I then "enable joining" and try to pair the Zigbee light (Moes GU10 spot). This does not work and when I try to "disable joining", the command times out. The whole network starts hanging after this:

...all good so far...
info  2023-11-06 09:16:42: Zigbee: disabling joining new devices.
error 2023-11-06 09:16:53: Request 'zigbee2mqtt/bridge/request/permit_join' failed with error: 'SRSP - AF - dataRequestExt after 6000ms'
debug 2023-11-06 09:16:53: Error: SRSP - AF - dataRequestExt after 6000ms
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
info  2023-11-06 09:16:53: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{},"error":"SRSP - AF - dataRequestExt after 6000ms","status":"error","transaction":"npsyl-2"}'
error 2023-11-06 09:16:59: Publish 'get' 'state' to 'Büro Lichtschalter' failed: 'Error: Read 0xdc8e95fffe09143b/1 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms)'
debug 2023-11-06 09:16:59: Error: Read 0xdc8e95fffe09143b/1 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms)
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
error 2023-11-06 09:17:17: Failed to read state of 'Küche Zigfred' after reconnect (Read 0xf4ce36da720badf7/5 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms))
error 2023-11-06 09:17:23: Publish 'set' 'state' to 'Bad Zigfred' failed: 'Error: Command 0xf4ce3687b7625a0b/6 genLevelCtrl.moveToLevelWithOnOff({"level":0,"transtime":0}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms)'
debug 2023-11-06 09:17:23: Error: Command 0xf4ce3687b7625a0b/6 genLevelCtrl.moveToLevelWithOnOff({"level":0,"transtime":0}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms)
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)

... many more errors

Full log from startup to manual shutdown of Z2M: log_2023-11-06.09-11-04.txt

@msaario, is there an easy way to revert to an older version of the Home Assistant Z2M in addon?

msaario commented 10 months ago

I restored partially backup of only zigbee2mqtt addon. It should by default create backup after every update. Then I disabled automatic updates on zigbee2mqtt.

Edit: I solved my issue fully by updating coordinator EZSP 12 7.3.1.0 firmware on ZB-GW04 and now latest version works without errors.

Koenkk commented 10 months ago

20230507 is not stable indeed, @Stadicus please continue in https://github.com/Koenkk/Z-Stack-firmware/discussions/478

Stadicus commented 10 months ago

Thanks! As written above, falling back on 20221226 solved some of the issues. I will continue to provide feedback regarding the firmware in the linked issue.


But even with my coordinator now running 20221226I stil lhave problems with the latest addon version 1.33.2-1. It starts normally, gets a few device updates and then hangs with many of the 0x02: INVALID_PARAM errors. I don't get these at all when reverting back to version 1.30.3-1:

I reverted now back to 1.30.3, but I'm unsure what could cause this issue and if a future upgrade might fix it.

Koenkk commented 10 months ago

can you also try z2m versions such that we know the last working version?

Stadicus commented 10 months ago

Sure thing. I'm not sure how to update to a specific, non-latest version, however. Any hints?

Koenkk commented 10 months ago
Stadicus commented 10 months ago

Home Assistant unfortunately doesn't allow switching to specific version

Dang. Let me know if I can do some Docker magic or whatnot. I'm not that bad in figuring stuff out... :)

I think the issue is caused by the spamming of Büro Presence and Wohnzimmer Presence, is the issue fixed when removing those from power?

Unfortunately not. I disconnected both presence sensors before updating, and the 0x02: INVALID_PARAM error was thrown immediately after the update. The network becomes unresponsive 12 seconds after startup. In the 4 hours, with the older version, the network was ok and this error it didn't occur once:

There are two other identical presence sensors, but they are not nearly as chatty.

Koenkk commented 9 months ago

I disconnected both presence sensors before updating, and the 0x02: INVALID_PARAM error was thrown immediately after the update.

That's really interesting, could you provide the herdsman debug logging? (both 1.30.3 and 1.33.2). Then we can see all the serial commands that are send to the adapter.

Stadicus commented 9 months ago

The network was running fine with v1.30.3 the whole day, with herdsman logging enabled. I got a few hours of logs from Docker (zipped, as it's 90+ MB) and couldn't find a single 0x02: INVALID_PARAM error.

I just updated to the latest Z2M version 1.33.2 again, and got the 0x02: INVALID_PARAM error directly after startup. Around these error messages, I don't see many log entries from herdsman, however. I just let it run for a few minutes and at 2023-11-08 00:29:14 I stopped the service and captured some of the shutdown process as well.

Note: the two presence sensors are connected again and don't cause any issues when running v1.30.3.

Stadicus commented 9 months ago

Quick update: early today around 6am my network running with coordinator firmware 20221226 and Z2M version 1.30.3 became unresponsive, with many timeouts (error SRSP - AF - dataRequest after 6000ms, see logfile).

After restart of the coordinator and the service, I now also experience the dreaded 0x02: INVALID_PARAM error with version 1.30.3. So far, I was able to reliably switch between the versions and only get this error on the latest version, but it doesn't seem so clear-cut.

Another restart of the Z2M service (preceeded by a power-cycle of the coordinator) brought back a stable state.

Koenkk commented 9 months ago

I noticed that you are running the coordinator over a network, do you have a possibility to connect it over USB and see if that solves the issue?

Stadicus commented 9 months ago

I tried a Sonoff coordinator before, with shorter and very long USB cables. Due to the placement of Home Assistant in a 19" rack, the position in my home was just not working.

Now I use the Hamgeek coordinator with PoE through a decent Cat.6 cable (no Wifi anywhere near it). It's so much better. But unfortunately I can't get a USB cable to this location, and I don't think the coordinator is even capable of connecting through USB.

Koenkk commented 9 months ago

Can you try with z2m 1.33.2 to:

  1. enter the Docker container (read here how to do it)
  2. execute npm install zigbee-herdsman@0.14.103
  3. Restart the container with docker restart (not from the HA ui!)
  4. Check if the issue is fixed now
Stadicus commented 9 months ago

Your instructions seem straight-forward, but I get the following error:

/bin/sh: npm: not found

I checked the following:

On Home Assistant via SSH, I checked the running Docker container:

docker ps | grep zigbee2mqtt
60fc46b16f37   zigbee2mqtt/zigbee2mqtt-aarch64:1.33.2-1   "/sbin/tini -- /dock…"   9 minutes ago   Up 9 minutes   0.0.0.0:8485->8485/tcp, :::8485->8485/tcp   addon_45df7312_zigbee2mqtt

Trying to understand the image itself, looking up the Dockerfile, I would have expected to see the files configuration.yaml and start.sh in the working directory /app, but I see only the following listing:

/app # ls -la
total 60
drwxr-xr-x    1 root     root          4096 Nov  1 20:37 .
drwxr-xr-x    1 root     root          4096 Nov  9 21:02 ..
-rw-rw-r--    1 root     root         35141 Nov  1 20:15 LICENSE
drwxr-xr-x    6 root     root          4096 Nov  1 20:34 dist
-rw-rw-r--    1 root     root          4044 Nov  1 20:15 index.js
drwxr-xr-x  192 root     root          4096 Nov  1 20:35 node_modules
-rw-rw-r--    1 root     root          3037 Nov  1 20:15 package.json

What am I missing?

Koenkk commented 9 months ago

Can you run apk add npm in the container?

Stadicus commented 9 months ago

That command worked, but without any obvious improvement for Z2M:


This is what I did in the Docker container:

/app $ apk add npm
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/aarch64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/aarch64/APKINDEX.tar.gz
(1/1) Installing npm (8.10.0-r0)
Executing busybox-1.35.0-r17.trigger
OK: 74 MiB in 49 packages
/app $ npm install zigbee-herdsman@0.14.103
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: 'zigbee2mqtt-frontend@0.6.142',
npm WARN EBADENGINE   required: { node: '>=18' },
npm WARN EBADENGINE   current: { node: 'v16.20.2', npm: '8.10.0' }
npm WARN EBADENGINE }

added 8 packages, removed 1 package, changed 15 packages, and audited 219 packages in 1m

33 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

Edit: updated logfile, I attached the wrong one.

Koenkk commented 9 months ago

Could you provide the herdsman debug log from starting z2m until the error for : 1.30.3 (no error), 1.33.2 with zigbee-herdsman 0.14.103? I know you provide logging for 1.30.3 in v1.30.3_log_herdsman_2023-11-07.zip but this does not include the startup.

Stadicus commented 9 months ago

Certainly. To cover as much ground as possible, I went through the whole update process in controlled steps.

  1. Starting point: ✅ working network with z2m 1.30.3 and herdsman 0.21
  2. after installing herdsman 0.14 & Docker restart --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.30.3-20231111-1-update-herdsman-0.14.txt
  3. after additional Docker restart --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.30.3-20231111-2-docker-restart.txt
  4. after service restart through Home Assistant --> ✅ working network with z2m 1.30.3 and herdsman 0.14
    z2m-logs-v1.30.3-20231111-3-service-restart.txt
  5. after update to z2m 1.33.2 (herdsman 0.21) through Home Assistant --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.33.2-20231111-1-z2m-update-hersman-0.21.txt
  6. after installing herdsman 0.14 and Docker restart --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.33.2-20231111-2-update-herdsman-0.14.txt
  7. after additional Docker restart --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.33.2-20231111-3-docker-restart.txt
  8. after service restart through Home Assistant --> ❌ network down with 0x02: INVALID_PARAM z2m-logs-v1.33.2-20231111-4-service-restart.txt

I hope that helps to systematically get to the bottom of this issue.

Koenkk commented 9 months ago

Interesting, I didn't think of upgrading the herdsman version in 1.30.3. Can you test the following with 1.30.3:

Stadicus commented 9 months ago

Ah, my bad. I just assumed that v1.30.3 already came with herdsman 0.21, but you're right: it's 0.14 and thus the starting point already was z2m v1.30.3 + herdsman 0.14.

After the whole logging session earlier today, I restored to v1.30.3 / herdsman 0.14 and had a well-running network again until just now.

Now I restarted the service from Home Assistant, and can't get it into a running state any more. Even after many restarts, without changing anything in the Docker container, resetting the coordinator (ESP32 + Zigbee), power-cycling the coordinator, and even restoring the working config from earlier today, the system just hangs with the 0x02: INVALID_PARAM every time...

A full restart of Home Assistant (hardware reboot) resolved the issue and now my network is running well again... Maybe a full hardware reboot would not have necessary, but I directly went for the big gun.

Koenkk commented 9 months ago

Interesting! What after you get it working again go into the container do apk add zigbee-herdsman@0.14.103, restart the container? Does it break again?

If yes, what if you start from the working situation, do a restart of the container, does it break again?

Stadicus commented 9 months ago

I think the required repository is missing:

/app $ apk add zigbee-herdsman@0.14.103
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/aarch64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/aarch64/APKINDEX.tar.gz
WARNING: The repository tag for world dependency 'zigbee-herdsman@0.14.103' does not exist
ERROR: Not committing changes due to missing repository tags. Use --force-broken-world to override.
Koenkk commented 9 months ago

Sorry i meant npm install instead of apk add

Stadicus commented 9 months ago

I first tried just a plain docker restart. It seems to bring down the network every time, while a full service restart from Home Assistant leads to a working network again. I therefore didn't try to re-install herdsman in the Docker container, as this also requireds a restart.

I tried this several times to see if it's reproducible.

Starting with a working network, a plain docker restart led to a broken network:
z2m-logs-v1.30.3-20231112-2-docker-restart-NWK_TABLE_FULL.txt

A Home Assistant service restart lead to a working network again. I then restarted the Docker container through SSH again, leading only to the INVALID_PARAM error:
z2m-logs-v1.30.3-20231112-3-service-restart-docker-restart.txt

Home Assistant service restart lead to a working network again. I then restarted the Docker container through SSH again, leading again to the INVALID_PARAM error:
z2m-logs-v1.30.3-20231112-4-service-restart-docker-restart.txt

Home Assistant service restart lead to a working network again.

Koenkk commented 9 months ago

I'm really starting to think there is some NV corruption at your coordinator side, when flashing the firmware, did you wipe the entire device? Or do you maybe have another coordinator available for testing?

Stadicus commented 9 months ago

This is how I flashed the coordinator with the older firmware, just clicking on "Start":

image

I've already ordered an identical PoE coordinator to use as a router and can use it as a replacement coordinator.

If I understand the FAQ correctly, both fully erasing the current device (clicking on "Erase"), or swapping out the coordinator with an identical model wouldn't require re-pairing all devices, correct?

Koenkk commented 9 months ago

Ok lets check with the replace coordinator!

or swapping out the coordinator with an identical model wouldn't require re-pairing all devices, correct?

No repairing needed, but makes sure you transfer the ieee address (https://www.zigbee2mqtt.io/guide/adapters/flashing/copy_ieeaddr.html#copying-the-ieee-address-of-an-adapter)

Stadicus commented 9 months ago

Thanks for the info, I'll try with the replacement coordinator as soon as I get it.

For now, I fully erased the existing coordinator, but did not touch the existing firmware (20221226). IIUC, from a config perspective that should amount to the same effect?

Maybe that already helps.

Koenkk commented 9 months ago

Coordinator earse & service restart --> network (more or less) stable, reconnecting

Was this a docker restart or via HA?

Stadicus commented 9 months ago

This was a full restart via Home Assistant. But a while after the network became instable even with the older Z2M version and I was not able to bring it up anymore.

Since then, I completely reset my network and am in the process of building it up again. I'm using more strategically placed repeaters running your repeater firmware. I started with the most current version and with 50+ devices paired haven't run into any problems yet. I'll need a few more days and then I can give a full update.

Stadicus commented 7 months ago

Just a quick update: I'm still struggling with my network. I ordered a Zigstar UZG with the new CC1352P7 chip to test if this helps, but I still face issues with the latest 20230507 firmware. Unfortunately, there's no older "CC1352P7" release available, so I'll probably fall back to the older HamGeek Zigstar router with the CC1352P2 chip.

Shellazo commented 7 months ago

I'm having the same issue. In fact, it started when I was using SkyConnect, a few weeks ago (or a similar issue.) I had a Sonoff ZBDONGLE-P in use as a router. Since I'd heard that the P is more stable, I figured I'd swap to that. The migration was fairly easy. The only issue I had was that most of router devices showed up as "missing routers" after a coordinator_check. Again, I just re-paired everything (which is easy with Z2M, just put coordinator in join mode then remove device and the device pops back up).

In any event, the errors all stopped. Until today. The odd thing is that everything seemingly works, but with the same errors discussed above.

I'm on 1.35.1 and have the 20230507 firmware. Might try downgrading firmware but am shy about repairing the missing routers again (since I just did it).

supaeasy commented 2 months ago

I am really starting to think this has to do with Zigfred Devices. Came to this issue by googling Z2m and Zigfred issues. A few days after I installed a Zigfred Plus my once rock solid network became highly unstable, Lights turning on randomly in maximum dimmed mode etc.. It only gets worse, right now my network is close to unusable. @Stadicus did your issues start with the installation of Zigfred or before?

Stadicus commented 2 months ago

It's hard to tell because I used Zigfred devices from the very beginning. I wanted to provide an update here for a long time, and will do so in a few days.

For now: My network is pretty stable, despite me adding Zigfred devices again over time. But they have the worst uptime and are frequently not reachable, causing a lot of error messages.

supaeasy commented 2 months ago

So do you mean re-pairing of Zigfred devices every other month is needed? I have a Zigfred since April and had to reset it to factory defaults and re-pair it twice since then. Have updated to Firmware 31 recently and hope this helps with stability. At least finally it is possible to not fade in lamps, i hated that this couldn't be changed before.