Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
12.03k stars 1.67k forks source link

Z2M crash while running OTA update of Philips Hue bulb #22463

Closed Ricc68 closed 3 months ago

Ricc68 commented 5 months ago

What happened?

During the OTA update of a Philips Hue bulb z2m crashed. One thing to note is how slow is the OTA update, wants more than 6 hours to update the light bulb firmware: I wonder if this is normal (no previous experience with ZigBee so I don't know).

What did you expect to happen?

OTA update complete successfully and possibly a faster OTA update as I never seen in my life a device requiring so much time to update. BTW this slow update also happens with the other 2 Sonoff TRVZB valves so the cause does not seem to be a problematic device.

How to reproduce it (minimal and precise)

Run the OTA update of the Philips Hue bulb.

Zigbee2MQTT version

1.37.0

Adapter firmware version

7.4.2 [GA]

Adapter

Sonoff ZBDongle-E

Setup

Add-on on Home Assistant OS, host is a VM on x86-64, dongle uses ember driver

Debug log

[2024-05-05 12:32:04] info: z2m: Update available for 'Lampada del soggiorno' [2024-05-05 12:32:17] info: z2m: Updating 'Lampada del soggiorno' to latest firmware [2024-05-05 12:32:18] info: z2m: Update of 'Lampada del soggiorno' at 0.00% [2024-05-05 12:32:52] info: z2m: Update of 'Lampada del soggiorno' at 0.14%, ≈ 401 minutes remaining [2024-05-05 12:33:25] info: z2m: Update of 'Lampada del soggiorno' at 0.27%, ≈ 411 minutes remaining [2024-05-05 12:34:01] info: z2m: Update of 'Lampada del soggiorno' at 0.51%, ≈ 339 minutes remaining [2024-05-05 12:34:36] info: z2m: Update of 'Lampada del soggiorno' at 0.68%, ≈ 337 minutes remaining [2024-05-05 12:35:09] info: z2m: Update of 'Lampada del soggiorno' at 0.82%, ≈ 346 minutes remaining [2024-05-05 12:35:11] error: zh:ember: Delivery of BROADCAST failed for "65532" [apsFrame={"profileId":0,"clusterId":54,"sourceEndpoint":0,"destinationEndpoint":0,"options":256,"groupId":0,"sequence":154} messageTag=70] [2024-05-05 12:35:11] error: zh:ember: Delivery of BROADCAST failed for "65533" [apsFrame={"profileId":41440,"clusterId":33,"sourceEndpoint":242,"destinationEndpoint":242,"options":256,"groupId":0,"sequence":155} messageTag=48] [2024-05-05 12:35:43] info: z2m: Update of 'Lampada del soggiorno' at 0.95%, ≈ 356 minutes remaining [2024-05-05 12:36:18] info: z2m: Update of 'Lampada del soggiorno' at 1.17%, ≈ 339 minutes remaining [2024-05-05 12:36:57] info: z2m: Update of 'Lampada del soggiorno' at 1.49%, ≈ 307 minutes remaining [2024-05-05 12:37:31] info: z2m: Update of 'Lampada del soggiorno' at 1.66%, ≈ 309 minutes remaining Error: Delivery failed for {"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4352,"groupId":0,"sequence":0} at EmberOneWaitress.deliveryFailedFor (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/oneWaitress.ts:96:31) at EmberAdapter.onMessageSentDeliveryFailed (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:558:30) at Ezsp.emit (node:events:517:28) at Ezsp.ezspMessageSentHandler (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:3957:18) at Ezsp.callbackDispatch (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:794:18) at Ezsp.tick (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:448:22) at listOnTimeout (node:internal/timers:569:17) at processTimers (node:internal/timers:512:7)

Koenkk commented 5 months ago

Could you provide the debug log of this?

See this on how to enable debug logging.

Ricc68 commented 5 months ago

@Koenkk I am attaching a debug log containing a first attempt at updating the Philips Hue which failed almost immediately and a second attempt which is running (not failed yet) but very very slowly. Consider that this bulb is at no more than 2m from the coordinator and LQI is 216 which means it should be well in range for a flawless communication. In the log I also see some occurrences of serial port close -> reopen which is very strange. log.log

Ricc68 commented 5 months ago

@Koenkk In the end I don't know if the update failed or not because z2m crashed with these final messages (not logged in log.log) before it died (I'm attaching the full debug log which by the way contains also the broadcast errors that @Nerivec asked for): log.log

[2024-05-06 20:01:49] debug: zh:ember:queue: Status queue=0 priorityQueue=0. [2024-05-06 20:01:49] debug: zh:ember: ~~~> [ZCL to=10823 apsFrame={"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4352,"groupId":0,"sequence":0} header={"frameControl":{"reservedBits":0,"frameType":1,"direction":1,"disableDefaultResponse":true,"manufacturerSpecific":false},"manufacturerCode":null,"transactionSequenceNumber":15,"commandIdentifier":5}] [2024-05-06 20:01:49] debug: zh:ember:ezsp: ===> [FRAME: ID=52:"SEND_UNICAST" Seq=104 Len=88] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=0 frmRx=5] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=5] Added to rxQueue [2024-05-06 20:01:49] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=6] [2024-05-06 20:01:49] debug: zh:ember:ezsp: <=== [FRAME: ID=52:"SEND_UNICAST" Seq=104 Len=7] [2024-05-06 20:01:49] debug: zh:ember:ezsp: ~~~> [SENT type=DIRECT apsSequence=0 messageTag=58 status=SUCCESS] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1] [2024-05-06 20:01:49] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=6] Added to rxQueue [2024-05-06 20:01:49] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=7] [2024-05-06 20:01:49] debug: zh:ember:ezsp: <=== [FRAME: ID=63:"MESSAGE_SENT_HANDLER" Seq=104 Len=89] [2024-05-06 20:01:49] debug: zh:ember:ezsp: ezspMessageSentHandler(): callback called with: [type=DIRECT], [indexOrDestination=10823], [apsFrame={"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4352,"groupId":0,"sequence":0}], [messageTag=58], [status=DELIVERY_FAILED], [messageContents=190f05000b1014010a1f0001e4a10000323714db855d071104c7292206f61ad17b83b26efe9e137a0fe6b579e1686544fc2202acd1c5a6eb678e23e9a4096ba84ad23f] Error: Delivery failed for {"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4352,"groupId":0,"sequence":0} at EmberOneWaitress.deliveryFailedFor (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/oneWaitress.ts:96:31) at EmberAdapter.onMessageSentDeliveryFailed (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:558:30) at Ezsp.emit (node:events:517:28) at Ezsp.ezspMessageSentHandler (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:3957:18) at Ezsp.callbackDispatch (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:794:18) at Ezsp.tick (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:448:22) at listOnTimeout (node:internal/timers:569:17) at processTimers (node:internal/timers:512:7)

Koenkk commented 5 months ago

With crash, do you mean that z2m completely stops after this?

@Nerivec could it be that we are missing an await somewhere? This is a command response imageBlockResponse which has a catch: https://github.com/Koenkk/zigbee-herdsman-converters/blob/0bdd6bf799d131a06a013dc3003eadcd760faf5d/src/lib/ota/common.ts#L499

Ricc68 commented 5 months ago

Yes, with crash I mean z2m completely stops after this. To run z2m again I had to press the Start button.

mundschenk-at commented 5 months ago

With the latest Hue FW updates becoming available, I have about 50 devices to update. Each is super slow and often times out after an hour or so, so you need to start the update again. Unfortunately, there is also no way to schedule multiple updates in series (marking several devices to update eventually brings down the network due to congestion).

Nerivec commented 5 months ago

It must be something specific to Hue that is causing trouble. I had a report, not too long ago, of almost 20 Inovelli devices being updated (sequentially of course 😁) without issue on an ember network with over 300 devices; that would indicate the underlaying system (OTA & driver) is more than fine. Of course, that kind of stress on the network is entirely dependent on the quality of the network, routes taken during OTA (any subpar router involved), anything spamming the network at the same time (like those bad Tuya devices), etc...

mundschenk-at commented 5 months ago

To me it looks like the stability issue has something to with last_seen being enabled as per https://github.com/Koenkk/zigbee2mqtt/issues/19531 - Z2M publishes MQTT state message every few hundred milliseconds, several times per second (!) when last_seen is enabled. I can't see anything less than a second as a useful granularity, probably even 5 seconds would be enough.

With only one update at a time (if that's a hard requirement, Z2M should enforce it), updating the Hue lights seems to be stable for me today, but still quite slow (about an hour for one light, will have to check the logs for better statistics). What I did notice with a quick look at the debug logs is a consistent discrepancy between the requested and sent chunk size:

[2024-05-29 15:32:24] debug:    zhc:ota:common: Request offsets: fileOffset=179206 pageOffset=0                 dataSize=64
[2024-05-29 15:32:24] debug:    zhc:ota:common: Payload offsets: start=179206 end=179256 dataSize=50
mundschenk-at commented 5 months ago

I also get this a lot in the logs, not sure if this is expected:

[2024-05-29 15:49:03] debug:    zh:controller: Skipping command 'imageBlockRequest' because it is missing from the lookup
mundschenk-at commented 5 months ago

Sorry @Nerivec, I may have muddied the waters here - I am using zStack, not Ember. Should I open a separate issue?

Nerivec commented 5 months ago

No on the contrary, it confirms this is a more "generic" issue.

The first dataSize should be maximumDataSize (the value received), the second one is the value determined/used by Z2M. The 'skipping' is because OTA is handled separately, it doesn't have any impact.

At ~50 bytes per message, and the average firmware being 250KB (some Hue appear to be double that), that's a lot of messages, plus it is throttled to avoid crashes, so timeframe for update is unfortunately pretty long; best scenario is unlikely to be less than 20 minutes.

mundschenk-at commented 5 months ago

Would it be possible to increase the chunk size to the requested 64 bytes? I know there was a similar discussion in https://github.com/Koenkk/zigbee-herdsman-converters/pull/6193 and it would a appear that the 50 bytes is more or less an arbitrary limit (i.e. it probably should not be significantly higher due to network congestion, but a power-of-two value like 64 would at least seem more natural too me).

It's been some time since I had a Hue Bridge in use, but FW updates always seemed much quicker to me then. (Though it is possible that this only seemed that way because upgrading was "fire and forget", it would update all devices automatically once you gave it the go.)

Nerivec commented 5 months ago

From the comments in code, it seems the 50 was chosen because higher values often result in instabilities. Also have to consider deeply nested routes and the cost involved. The throttling that Z2M does probably is higher than what the Hue Bridge does, hence the possibly longer timeframe. It's a static value at the moment (250ms), of course a variable one would be better, but the problem is with supporting the myriad of setups/networks/devices... 😅

thargy commented 4 months ago

I have 170 Hue Devices (FML), I would probably be willing to kill for the option to update all (even if it did them sequentially).

As it is, I often have 10-20 on the go at a time, and they usually take 2-3 hours each. I've not seen any real problems (though Z2m can respond slowly to clicks on the UI).

Nerivec commented 4 months ago

@mundschenk-at Any chance you can run custom code to test an OTA refactor? If yes, find me on zigbee2mqtt Discord.

mundschenk-at commented 4 months ago

@Nerivec Sure, I've pinged you on Discord.

Nerivec commented 4 months ago

I know @mundschenk-at had some pretty good results with the new OTA refactoring. Had some good feedback on large ember networks too. So I'm going to tag this. @Ricc68 not sure if you can test latest dev branch, otherwise it will be in July release. Let me know then how it goes.

Ricc68 commented 4 months ago

@Nerivec can't easily test the dev branch as I'm on HAOS. Will report back in July with the next release.

mundschenk-at commented 4 months ago

@Ricc68 You could switch to the Zigbee2MQTT Edge add-on.

Ricc68 commented 4 months ago

@Nerivec I have installed the Edge branch (latest dev) but the OTA update keeps failing. The good news is that z2m is not crashing anymore. It says the device didn't respond to OTA request, but the strange thing is that this device always responded to OTA requests before and now is not responding anymore, I have also tried to power cycle the bulb, just in case, but no joy (btw, the bulb is not responding also to the on/off commands):

[2024-06-15 18:18:42] debug: z2m:mqtt: Received MQTT message on 'zigbee2mqtt/bridge/request/device/ota_update/update' with data '{"id":"Lampada del soggiorno","transaction":"rpjz7-1"}' [2024-06-15 18:18:42] info: z2m: Updating 'Lampada del soggiorno' to latest firmware [2024-06-15 18:18:42] debug: zh:controller:endpoint: ZCL command 0x001788010bd3aeb7/11 genBasic.read(["dateCode","swBuildId"], {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false,"sendPolicy":"immediate"}) [2024-06-15 18:18:42] debug: zh:ember:queue: Status queue=0 priorityQueue=0. [2024-06-15 18:18:42] debug: zh:ember: ~~~> [ZCL to=6105 apsFrame={"profileId":260,"clusterId":0,"sourceEndpoint":1,"destinationEndpoint":11,"options":4416,"groupId":0,"sequence":0} header={"frameControl":{"reservedBits":0,"frameType":0,"direction":0,"disableDefaultResponse":true,"manufacturerSpecific":false},"manufacturerCode":null,"transactionSequenceNumber":7,"commandIdentifier":0}] [2024-06-15 18:18:42] debug: zh:ember:ezsp: ===> [FRAME: ID=52:"SEND_UNICAST" Seq=56 Len=28] [2024-06-15 18:18:42] debug: zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=0 frmRx=2] [2024-06-15 18:18:42] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-06-15 18:18:42] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1] [2024-06-15 18:18:42] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=2] Added to rxQueue [2024-06-15 18:18:42] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=3] [2024-06-15 18:18:42] debug: zh:ember:ezsp: <=== [FRAME: ID=52:"SEND_UNICAST" Seq=56 Len=7] [2024-06-15 18:18:42] debug: zh:ember:ezsp: ~~~> [SENT type=DIRECT apsSequence=216 messageTag=6 status=SUCCESS] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=3] Added to rxQueue [2024-06-15 18:18:47] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=4] [2024-06-15 18:18:47] debug: zh:ember:ezsp: <=== [FRAME: ID=63:"MESSAGE_SENT_HANDLER" Seq=56 Len=29] [2024-06-15 18:18:47] debug: zh:ember:ezsp: ezspMessageSentHandler(): callback called with: [type=DIRECT], [indexOrDestination=6105], [apsFrame={"profileId":260,"clusterId":0,"sourceEndpoint":1,"destinationEndpoint":11,"options":4160,"groupId":0,"sequence":216}], [messageTag=6], [status=DELIVERY_FAILED], [messageContents=10070006000040] [2024-06-15 18:18:47] debug: zh:controller:endpoint: Error: ZCL command 0x001788010bd3aeb7/11 genBasic.read(["dateCode","swBuildId"], {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false,"sendPolicy":"immediate"}) failed (Delivery failed for {"profileId":260,"clusterId":0,"sourceEndpoint":1,"destinationEndpoint":11,"options":4160,"groupId":0,"sequence":216}) [2024-06-15 18:18:47] debug: zhc:ota:common: Updating '0x001788010bd3aeb7' (LWA017) to latest [2024-06-15 18:18:47] debug: zhc:ota:common: Using endpoint '11' [2024-06-15 18:18:47] debug: zh:controller:endpoint: CommandResponse 0x001788010bd3aeb7/11 genOta.imageNotify({"payloadType":0,"queryJitter":100}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":1,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false,"sendPolicy":"immediate"}) [2024-06-15 18:18:47] debug: zh:ember:queue: Status queue=0 priorityQueue=0. [2024-06-15 18:18:47] debug: zh:ember: ~~~> [ZCL to=6105 apsFrame={"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4352,"groupId":0,"sequence":0} header={"frameControl":{"reservedBits":0,"frameType":1,"direction":1,"disableDefaultResponse":true,"manufacturerSpecific":false},"manufacturerCode":null,"transactionSequenceNumber":8,"commandIdentifier":0}] [2024-06-15 18:18:47] debug: zh:ember:ezsp: ===> [FRAME: ID=52:"SEND_UNICAST" Seq=57 Len=26] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=1 frmRx=4] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2] [2024-06-15 18:18:47] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2 frmNum=4] Added to rxQueue [2024-06-15 18:18:47] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=5] [2024-06-15 18:18:47] debug: zh:ember:ezsp: <=== [FRAME: ID=52:"SEND_UNICAST" Seq=57 Len=7] [2024-06-15 18:18:47] debug: zh:ember:ezsp: ~~~> [SENT type=DIRECT apsSequence=217 messageTag=7 status=SUCCESS] [2024-06-15 18:18:48] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-06-15 18:18:48] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2] [2024-06-15 18:18:48] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=2 frmNum=5] Added to rxQueue [2024-06-15 18:18:48] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=6] [2024-06-15 18:18:48] debug: zh:ember:ezsp: <=== [FRAME: ID=63:"MESSAGE_SENT_HANDLER" Seq=57 Len=27] [2024-06-15 18:18:48] debug: zh:ember:ezsp: ezspMessageSentHandler(): callback called with: [type=DIRECT], [indexOrDestination=6105], [apsFrame={"profileId":260,"clusterId":25,"sourceEndpoint":1,"destinationEndpoint":11,"options":4096,"groupId":0,"sequence":217}], [messageTag=7], [status=DELIVERY_FAILED], [messageContents=1908000064] [2024-06-15 18:19:47] debug: z2m: Update of 'Lampada del soggiorno' failed (Error: Device didn't respond to OTA request) [2024-06-15 18:19:47] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/Lampada del soggiorno', payload '{"brightness":200,"linkquality":216,"power_on_behavior":"previous","state":"OFF","update":{"installed_version":16785162,"latest_version":16786434,"state":"available"},"update_available":null}' [2024-06-15 18:19:47] info: z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/response/device/ota_update/update', payload '{"data":{"id":"Lampada del soggiorno"},"error":"Update of 'Lampada del soggiorno' failed (Device didn't respond to OTA request)","status":"error","transaction":"rpjz7-1"}' [2024-06-15 18:19:47] error: z2m: Update of 'Lampada del soggiorno' failed (Device didn't respond to OTA request) [2024-06-15 18:19:47] debug: z2m: Error: Device didn't respond to OTA request at requestOTA (/app/node_modules/zigbee-herdsman-converters/src/lib/ota/common.ts:379:15) at Object.updateToLatest (/app/node_modules/zigbee-herdsman-converters/src/lib/ota/common.ts:525:40) at OTAUpdate.onMQTTMessage (/app/lib/extension/otaUpdate.ts:272:41) at EventEmitter.wrappedCallback (/app/lib/eventBus.ts:174:17)

Nerivec commented 4 months ago

@Ricc68 Looks like your issue lies elsewhere (not OTA). Try bringing the device in question close to the coordinator, and re-pair the device to it directly. Then see if the device behaves better, you may have a router in-between the two that's making troubles. As long as you can't send simple commands to the device without encountering problems (stable network), no point even trying OTA, if it doesn't fail at first, it will likely fail at some point (it takes many, many messages to OTA successfully).

Ricc68 commented 4 months ago

@Nerivec understand your point, but the bulb is 1.5m away from the dongle-e and additionally the bulb is the only router in my small ZigBee network. True that there's lot of WiFi pollution from others and in addition the dongle is near my WiFi router but it was working fairly well before installing the dev branch. I will try to move the antenna further away.

Ricc68 commented 4 months ago

@Nerivec I have moved the dongle 1m further away from the WiFi router and things greatly improved. In 30 mins it updated the bulb and I don't see errors anymore. The good thing, other than having found the way to make my network more reliable, is that with the dev build a failed update is not crashing z2m anymore which I think is the true goal of this topic. And thanks for having me double check my setup which seems now to have solved the root cause of the failures (finger crossed).