Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
12.07k stars 1.68k forks source link

Hue bulb OTA update causes adapter restart #23852

Open CrazyCoder opened 1 month ago

CrazyCoder commented 1 month ago

What happened?

My issue is similar to #22873.

I saw OTA updates for Hue bulbs and tried to update just one. Every time I try it, after 10–30 seconds, the adapter restarts.

The same issue occurs with 1.40 Z2M release and 1.40.0-dev commit: 6dade8b.

I'm using HA Yellow with ncp-uart-hw-v7.4.3.0-yellow-115200.gbl adapter firmware.

Home Assistant
Core: 2024.8.3
Supervisor: 2024.08.0
Operating: System 13.1
Frontend: 20240809.0

Z2M configuration:

serial:
  port: /dev/ttyAMA1
  baudrate: 115200
  rtscts: false
  adapter: ember
zigbee_herdsman_debug: false

What did you expect to happen?

OTA update to work.

How to reproduce it (minimal and precise)

Try updating Hue bulb.

Zigbee2MQTT version

1.40

Adapter firmware version

7.4.3 [GA]

Adapter

ember

Setup

HA Yellow

Debug log

Full log starting from the moment of the OTA update: update.log.

The tail of the log:

[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:04] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: Timer expired waiting for ACK for frmTx=1, ackRx=0
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=DATA_RETX frmReTx=0 frmRx=1](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=0](ackRx=0 frmTx=1)
[2024-09-04 17:08:05] debug:  zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1](ackRx=0)
[2024-09-04 17:08:08] debug:  zh:ember:uart:ash: <--- [FRAME type=ERROR]
[2024-09-04 17:08:08] error:  zh:ember:uart:ash: Received ERROR from adapter, with code=ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT.
[2024-09-04 17:08:08] error:  zh:ember:uart:ash: ASH disconnected | Adapter status: ASH_NCP_FATAL_ERROR
[2024-09-04 17:08:08] error:  zh:ember:uart:ash: Error while parsing received frame, status=ASH_NCP_FATAL_ERROR.
[2024-09-04 17:08:08] error:  zh:ember: Adapter fatal error: HOST_FATAL_ERROR
[2024-09-04 17:08:08] debug:  zh:controller: Adapter disconnected
[2024-09-04 17:08:08] info:   zh:ember:uart:ash: ASH COUNTERS since last clear:
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Total frames: RX=393, TX=494
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Cancelled   : RX=0, TX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   DATA frames : RX=369, TX=137
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   DATA bytes  : RX=8391, TX=5725
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Retry frames: RX=21, TX=3
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   ACK frames  : RX=1, TX=353
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   NAK frames  : RX=0, TX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   nRdy frames : RX=0, TX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   CRC errors      : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Comm errors     : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Length < minimum: RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Length > maximum: RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Bad controls    : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Bad lengths     : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Bad ACK numbers : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Out of buffers  : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Retry dupes     : RX=21
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   Out of sequence : RX=0
[2024-09-04 17:08:08] info:   zh:ember:uart:ash:   ACK timeouts    : RX=3
[2024-09-04 17:08:08] info:   zh:ember:uart:ash: Port closed.
[2024-09-04 17:08:08] info:   zh:ember:uart:ash: ======== ASH stopped ========
[2024-09-04 17:08:08] info:   zh:ember:ezsp: ======== EZSP stopped ========
[2024-09-04 17:08:08] info:   zh:ember: ======== Ember Adapter Stopped ========
[2024-09-04 17:08:08] error:  z2m: Adapter disconnected, stopping
CrazyCoder commented 1 month ago

Likely related to #23704.

Koenkk commented 1 month ago

In case you are running z2m in a Docker container, does the following fix it?

docker exec -it CONTAINER_NAME sh
apk add nano
nano /app/node_modules/zigbee-herdsman-converters/lib/ota/common.js
# Change 'const IMAGE_BLOCK_RESPONSE_DELAY = 250;' to 'const IMAGE_BLOCK_RESPONSE_DELAY = 1000;'
# save and exit nano
exit
docker restart CONTAINER_NAME

(CC: @LaurentvdBos)

CrazyCoder commented 1 month ago

@Koenkk I made this change, but it didn't help:

[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: Timer expired waiting for ACK for frmTx=3, ackRx=2
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=DATA_RETX frmReTx=2 frmRx=7](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:26] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: Timer expired waiting for ACK for frmTx=3, ackRx=2
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=DATA_RETX frmReTx=2 frmRx=7](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2](ackRx=2 frmTx=3)
[2024-09-05 07:23:28] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7](ackRx=2)
[2024-09-05 07:23:31] debug:    zh:ember:uart:ash: <--- [FRAME type=ERROR]
[2024-09-05 07:23:31] error:    zh:ember:uart:ash: Received ERROR from adapter, with code=ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT.
[2024-09-05 07:23:31] error:    zh:ember:uart:ash: ASH disconnected | Adapter status: ASH_NCP_FATAL_ERROR
[2024-09-05 07:23:31] error:    zh:ember:uart:ash: Error while parsing received frame, status=ASH_NCP_FATAL_ERROR.
[2024-09-05 07:23:31] error:    zh:ember: Adapter fatal error: HOST_FATAL_ERROR
[2024-09-05 07:23:31] debug:    zh:controller: Adapter disconnected
[2024-09-05 07:23:31] info:     zh:ember:uart:ash: ASH COUNTERS since last clear:
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Total frames: RX=2715, TX=3553
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Cancelled   : RX=0, TX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   DATA frames : RX=2695, TX=891
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   DATA bytes  : RX=61501, TX=45253
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Retry frames: RX=15, TX=3
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   ACK frames  : RX=3, TX=2658
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   NAK frames  : RX=0, TX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   nRdy frames : RX=0, TX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   CRC errors      : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Comm errors     : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Length < minimum: RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Length > maximum: RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Bad controls    : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Bad lengths     : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Bad ACK numbers : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Out of buffers  : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Retry dupes     : RX=15
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   Out of sequence : RX=0
[2024-09-05 07:23:31] info:     zh:ember:uart:ash:   ACK timeouts    : RX=3
[2024-09-05 07:23:31] info:     zh:ember:uart:ash: Port closed.
[2024-09-05 07:23:31] info:     zh:ember:uart:ash: ======== ASH stopped ========
[2024-09-05 07:23:31] info:     zh:ember:ezsp: ======== EZSP stopped ========
[2024-09-05 07:23:31] info:     zh:ember: ======== Ember Adapter Stopped ========
Koenkk commented 1 month ago

@Nerivec do you think the adapter crashes because too many commands are sent to it?

Nerivec commented 1 month ago

If IMAGE_BLOCK_RESPONSE_DELAY really was set to 1s, it should be more than fine. I've tested the Dongle-E to almost 100 commands in 1.5sec without crash. Seems like wires are getting crossed as to which side is supposed to ACK.

@CrazyCoder How spammy is your network outside of OTA? Any chance you can try with one of the firmware described in this post?

CrazyCoder commented 1 month ago

@Nerivec The presence sensor is updating every 1 second, according to the logs. Other than that, all the other devices (less than 60 total) do not send updates that often. I don't know what to consider "spammy".

It's been working like that for 2+ years with no issues.

Any chance you can try with one of the firmware described in https://github.com/Koenkk/zigbee2mqtt/discussions/23224#discussioncomment-10422347?

I'll give it a shot.

CrazyCoder commented 1 month ago

@Nerivec I flashed nabucasa_yellow_ncp-uart-hw_115200_max_8.0.1.0.gbl and it appears to help. Updated two bulbs with no issues/restarts.

Should I stick to 8.x firmware for now?

Nerivec commented 1 month ago

Can you let it run for a few hours, then try to update a device again (if you have another that you can)?

The max firmware "working" would point to the adapter not being able to get rid of messages (from z2m or from network) fast enough with previous firmware (max has larger limits on that). The presence sensor likely doesn't help, but it shouldn't be a "deal breaker", as long as the rest of the network is more tamed. Either that or the flashing/hard-restart did the adapter some good.

A few options:

CrazyCoder commented 1 month ago

@Nerivec Thanks! Since previously I could cause the adapter to restart right away, I don't think running for a few hours would change anything. I updated a couple more bulbs, and there are three more remaining. Will try updating them later just in case.

jvuotila commented 1 month ago

Same problem here with SkyConnect. Crashing 2-4 times before one firmware update is complete. And I got 18 devices to update. Will take several days...

CrazyCoder commented 1 month ago

With 8.0.1.0 controller firmware I was able to update all the bulbs without crashing. Even updating two at the same time worked.

Nerivec commented 1 month ago

@jvuotila There's a "max" firmware for skyconnect in the link above too, if you want to give that one a try.

For reference, could you guys list the following for your respective networks:

CC: @darkxst

jvuotila commented 1 month ago

For reference, could you guys list the following for your respective networks:

  • Number of end devices
  • Number of routers

End devices: 93 Routers: 49

CrazyCoder commented 1 month ago

@Nerivec 56 devices total, 4 standalone routers (Custom devices (DiY) ti.router) + 18 other routers.

jvuotila commented 1 month ago

@jvuotila There's a "max" firmware for skyconnect in the link above too, if you want to give that one a try.

@Nerivec I have Silicon Labs firmware 7.4.3.0 (ncp-uart-hw-v7.4.3.0-skyconnect-115200.gbl). Seems to be totally different firmware than behind that link. Is this wrong version?

Nerivec commented 1 month ago

@jvuotila This one (extract archive to get the GBL file) is roughly the equivalent of the NabuCasa but with version 8.0.1 and higher configs. 8.0.1 is still "in testing", regular builders have not made that version available yet. You can revert back to 7.4.3 after you are done testing/updating devices if you prefer. You can flash it with https://github.com/Nerivec/ember-zli or https://github.com/NabuCasa/universal-silabs-flasher

jvuotila commented 1 month ago

@Nerivec Thanks, I flashed my coordinator. But I already managed to run all the updates thru earlier today. But now I got different firmware and let's see how it goes with it later.

iaincaradoc commented 1 month ago

Flashed my USB Skyconnect to 8.0.1 max, and disconnected the presence detector.

No change in behavior.

jvuotila commented 1 week ago

@Nerivec now I got new OTA updates and everything seems to work fine. So that firmware seems to be fine for me.