Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.5k stars 1.63k forks source link

Constant crash after random duration / #23013

Open gjaegy opened 4 weeks ago

gjaegy commented 4 weeks ago

What happened?

Hi,

I'm having big troubles getting Z2M to work in a stable manner. I'm using HAOS on a Raspberry 3B+, along with a Sonoff Zigbee Dongle-E device.

I'm using the new "ember" Z2M driver.

I initially updated the firmware using a standard release, then, after some investigation, switched to the "ember-nohw" branch of the "https://github.com/darkxst/silabs-firmware-builder/tree/ember-nohw/firmware_builds/zbdonglee" repository (tried both 7.4.1 & 7.4.3) with hardware flow control disabled in the config (rtscts: false), but I can't get Z2M to run a full day without crashing - it always crash after some time (usually a few hours, sometimes a day, but that not often).

I'm really not sure what else I can try at this stage...

Below my current config, and the log for the latest crash today.

Thanks for any help !!

homeassistant: true
advanced:
  network_key:
    - 74
    - 141
    - 155
    - 90
    - 4
    - 251
    - 100
    - 156
    - 74
    - 138
    - 129
    - 167
    - 203
    - 206
    - 248
    - 245
  pan_id: 51979
  ext_pan_id:
    - 252
    - 218
    - 94
    - 101
    - 106
    - 106
    - 206
    - 31
  homeassistant_legacy_entity_attributes: false
  legacy_api: false
  legacy_availability_payload: false
  log_syslog:
    app_name: Zigbee2MQTT
    eol: /n
    host: localhost
    localhost: localhost
    path: /dev/log
    pid: process.pid
    port: 514
    protocol: udp4
    type: '5424'
  channel: 20
  log_level: debug
mqtt:
  server: mqtt://core-mosquitto:1883
  user: addons
  password: Hie8Eiboo4leePh1iethiem6eevietheKaeg1tiu7Fie4xee0feoCh3unge5rieB
frontend:
  port: 8099
serial:
  port: >-
    /dev/serial/by-id/usb-ITEAD_SONOFF_Zigbee_3.0_USB_Dongle_Plus_V2_20231220153641-if00
  adapter: ember
  baudrate: 115200
  rtscts: false
device_options:
  legacy: false
devices:
  '0x70b3d52b60079bb8':
    friendly_name: PriseTuya1
  '0xe0798dfffebcaa0e':
    friendly_name: SwitchSonoffChaudiere
  '0x7cb03eaa0a091207':
    friendly_name: PriseOsram3
  '0x7cb03eaa0a09439d':
    friendly_name: PriseOsram2
  '0x7cb03eaa0a092cf9':
    friendly_name: PriseOsram1
  '0x00158d0009e06cf6':
    friendly_name: InterrupteurMi1
  '0x00158d0005d29824':
    friendly_name: ZLinky
    tarif: Standard - TEMPO
    kWh_precision: 3
    measurement_poll_interval: 60
    linky_mode: standard
    energy_phase: single_phase
    production: auto
    homeassistant: {}
    legacy: false
    optimistic: true
    qos: 1

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

No response

Zigbee2MQTT version

1.38.0-1

Adapter firmware version

7.4.1

Adapter

Sonoff Dongle-E

Setup

Raspberry 3B+ with HAOS (2024.6.1)

Debug log

[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 0/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 100/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 200/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 300/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 400/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 500/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 600/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 700/2500
[2024-06-11 11:31:59] debug:    zh:ember:uart:ash: Waiting for RSTACK... 800/2500
[2024-06-11 11:32:00] debug:    zh:ember:uart:ash: Waiting for RSTACK... 900/2500
[2024-06-11 11:32:00] debug:    zh:ember:uart:ash: Waiting for RSTACK... 1000/2500
[2024-06-11 11:32:00] debug:    zh:ember:uart:ash: <--- [FRAME type=RSTACK]
[2024-06-11 11:32:00] info:     zh:ember:uart:ash: ======== ASH connected ========
[2024-06-11 11:32:02] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=0]
[2024-06-11 11:32:02] info:     zh:ember:uart:ash: ======== ASH started ========
[2024-06-11 11:32:02] info:     zh:ember:ezsp: ======== EZSP started ========
[2024-06-11 11:32:02] debug:    zh:ember:ezsp: ===> [FRAME: ID=0:"VERSION" Seq=0 Len=4]
[2024-06-11 11:32:03] debug:    zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=0 frmRx=0]
[2024-06-11 11:32:06] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-06-11 11:32:07] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1]
[2024-06-11 11:32:07] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=0] Added to rxQueue
[2024-06-11 11:32:07] debug:    zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1]
[2024-06-11 11:32:07] debug:    zh:ember:ezsp: <=== [FRAME: ID=0:"VERSION" Seq=0 Len=7]
[2024-06-11 11:32:08] debug:    zh:ember: NCP info: EZSPVersion=13 StackType=2 StackVersion=29744
[2024-06-11 11:32:08] debug:    zh:ember:ezsp: ===> [FRAME: ID=170:"GET_VALUE" Seq=1 Len=6]
[2024-06-11 11:32:08] debug:    zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=1 frmRx=1]
[2024-06-11 11:32:11] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA]
[2024-06-11 11:32:12] debug:    zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1]
[2024-06-11 11:32:13] debug:    zhc:lixee: zlinky config: standard, single_phase, false, TEMPO
[2024-06-11 11:32:20] debug:    zh:controller:endpoint: ZCL command 0x00158d0005d29824/1 liXeePrivate.read(["activeEnergyOutD02","currentDate","currentIndexTarif","currentPrice"], {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false})
[2024-06-11 11:32:23] debug:    zh:ember:queue: Status queue=0 priorityQueue=0.
[2024-06-11 11:32:31] debug:    zh:ember:ezsp: =x=> [FRAME: ID=170:"GET_VALUE" Seq=1 Len=6] Error: Error: timed out after 14400ms
[2024-06-11 11:32:32] error:    zh:ember:ezsp: ERROR Transaction failure; status=undefined. Last: [FRAME: ID=170:"GET_VALUE" Seq=1 Len=6].
[2024-06-11 11:32:33] error:    zh:ember: !!! NCP FATAL ERROR reason=undefined. ATTEMPTING RESET... !!!
[2024-06-11 11:32:33] info:     zh:ember:queue: Request dispatching stopped; queue=1 priorityQueue=0
[2024-06-11 11:32:33] info:     zh:ember:uart:ash: ASH COUNTERS since last clear:
[2024-06-11 11:32:34] info:     zh:ember:uart:ash:   Total frames: RX=3, TX=5
[2024-06-11 11:32:35] info:     zh:ember:uart:ash:   Cancelled   : RX=0, TX=0
[2024-06-11 11:32:35] info:     zh:ember:uart:ash:   DATA frames : RX=1, TX=2
[2024-06-11 11:32:35] info:     zh:ember:uart:ash:   DATA bytes  : RX=7, TX=10
[2024-06-11 11:32:35] info:     zh:ember:uart:ash:   Retry frames: RX=1, TX=0
[2024-06-11 11:32:39] info:     zh:ember:uart:ash:   ACK frames  : RX=0, TX=2
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   NAK frames  : RX=0, TX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   nRdy frames : RX=0, TX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   CRC errors      : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Comm errors     : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Length < minimum: RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Length > maximum: RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Bad controls    : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Bad lengths     : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Bad ACK numbers : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Out of buffers  : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Retry dupes     : RX=1
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   Out of sequence : RX=0
[2024-06-11 11:32:41] info:     zh:ember:uart:ash:   ACK timeouts    : RX=0
[2024-06-11 11:32:42] error:    zh:ember: Failed to reset and init NCP. Error
[2024-06-11 11:32:43] debug:    zh:controller: Adapter disconnected
[2024-06-11 11:32:43] info:     zh:ember:queue: Request dispatching stopped; queue=1 priorityQueue=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash: ASH COUNTERS since last clear:
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Total frames: RX=3, TX=5
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Cancelled   : RX=0, TX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   DATA frames : RX=1, TX=2
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   DATA bytes  : RX=7, TX=10
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Retry frames: RX=1, TX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   ACK frames  : RX=0, TX=2
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   NAK frames  : RX=0, TX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   nRdy frames : RX=0, TX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   CRC errors      : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Comm errors     : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Length < minimum: RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Length > maximum: RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Bad controls    : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Bad lengths     : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Bad ACK numbers : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Out of buffers  : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Retry dupes     : RX=1
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   Out of sequence : RX=0
[2024-06-11 11:32:43] info:     zh:ember:uart:ash:   ACK timeouts    : RX=0
[2024-06-11 11:32:48] error:    zh:ember:uart:ash: Failed to close serial port Error: Port is not open.
[2024-06-11 11:32:51] info:     zh:ember:uart:ash: ======== ASH stopped ========
[2024-06-11 11:32:51] info:     zh:ember:ezsp: ======== EZSP stopped ========
[2024-06-11 11:32:51] info:     zh:ember: ======== Ember Adapter Stopped ========
[2024-06-11 11:32:52] error:    z2m: Adapter disconnected, stopping
[2024-06-11 11:32:55] debug:    z2m: Saving state to file /config/zigbee2mqtt/state.json
[2024-06-11 11:33:00] info:     z2m:mqtt: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"offline"}'
[2024-06-11 11:33:04] info:     z2m:mqtt: Disconnecting from MQTT server
[2024-06-11 11:33:07] info:     z2m: Stopping zigbee-herdsman...
[2024-06-11 11:33:08] debug:    zh:controller:database: Writing database to '/config/zigbee2mqtt/database.db'
[2024-06-11 11:33:10] info:     z2m: Stopped zigbee-herdsman
[2024-06-11 11:33:12] info:     z2m: Stopped Zigbee2MQTT

/app/node_modules/winston/node_modules/readable-stream/lib/_stream_writable.js:264
  var er = new ERR_STREAM_WRITE_AFTER_END();
           ^
Error: write after end
    at writeAfterEnd (/app/node_modules/winston/node_modules/readable-stream/lib/_stream_writable.js:264:12)
    at DerivedLogger.Writable.write (/app/node_modules/winston/node_modules/readable-stream/lib/_stream_writable.js:300:21)
    at DerivedLogger.log (/app/node_modules/winston/lib/winston/logger.js:252:14)
    at Logger.log (/app/lib/util/logger.ts:164:25)
    at Logger.error (/app/lib/util/logger.ts:169:14)
    at EmberAdapter.onNcpNeedsResetAndInit (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:1518:20)
Nerivec commented 4 weeks ago

Dongle-E does not support hardware flow control, so using ember-nohw or main branch, it will be the same firmware. That ember-nohw branch is only relevant for adapters that support hardware flow control (Skyconnect, Yellow, etc).

The log you posted shows lixee trying to read from the device even though z2m hasn't started yet, so there is definitely something very wrong there. It's also taking way too long for messages to come back, based on those timestamps.

PS: The ZLinky is known to be a terrible spammer, which can cause major issues in already-unstable networks. You may want to try running a clean Z2M install without pairing the ZLinky, first, to see if the issue disappears._

gjaegy commented 4 weeks ago

Thanks a lot for your support, much appreciated.

I've already rebooted the Pi (power off) but that doesn't prevent the issue to appear again, sometimes quite quickly after reboot.

I'm going to try your suggestions one by one. I've now plug the dongle through a power usb hub, let's see if that helps.

In the case the issue persists, I'll then disconnect the ZLinky - it would be annoying if that was the cause as th'he Zlinky was the primary reason I set up HA 😂

The zlinky is in the same room as the Pi, so that part of the network shouldn't be that unstable (i'm not a zigbee expert but I understood it's a kind of mesh/P2P network - the ZLinky should be directly connected to the Pi dongle.

Understood about flow control - the amount of bad information one can find in internet forums is crazy... So I guess I can remove the 'rtscts' line in my config, is that correct (and install the main branch version of the driver)?

Nerivec commented 4 weeks ago

So I guess I can remove the 'rtscts' line in my config, is that correct (and install the main branch version of the driver)?

rtscts defaults to false if not in the config, so yes, you can remove it, or leave it to false, same thing. You can keep the firmware til next update, since it's the same anyway.

As for the ZLinky, if you can't fix the problem with the suggestions, removing it from the network will at least give us an indication if it is the cause or not. Then we can go from there.

Can you provide the debug log from right after a clean reboot of HAOS (not from after a Z2M crash/restart)?

gjaegy commented 4 weeks ago

It seems I haven't experienced any crash for the last 48 hours, which is clearly a record.

What I've done so far:

While I'm surpised there would have been any under-power issue (I've checked the Pi under-power extension installed in HA and it didn't report any low-power periods), another aspect of that change is that I've slightly move the antenna to make it more "visible", so maybe the reception is a bit better now (who knows).

I suspect the ZLinky to have been misconfigured with a high update rate, however, I initially increase the update rate when I bought it, but since it didn't change anything, I rolled back the change - possibly it wasn't applied correctly, hence the "reconfigure" operation fixed that ?

I don't know, really.

For information, before I made the two changes mentioned above, HA start process often failed when the PI was rebooted (it crashed before start was completed). To solve that, I had to unplug the dongle-e, reboot the Pi, then plug the dongle-e once fully started, and start Z2M at that point. Could that be a hint power was an issue ?

I would suggest to leave that ticket open another week or so, I will keep you updated and try something else if needed.

I'll send you the log you asked for (but possibly, start is OK now).

gjaegy commented 4 weeks ago

log.log Attached the log file generated after a fresh restart of HA. Let me know if that's what you asked for !

thomasGomes commented 3 weeks ago

Hello,

I have the same behavior and i have a ZLinky too... After updated my dongle weeks ago to last stable version, the crashes happens less often. But 2 days ago the major part of my Zigbee devices was unavailable again... I think it is the ZLinky device which fall the first each time.

I will try to see if i can reconfigure it to reduce the update rate. For information, before using Zigbee2MQTT, i used Zigbee Home Automation without any problem (i made the change for a compatibility device).

thomasGomes commented 3 weeks ago

@gjaegy, can you share your configuration for the ZLinky and the process to reconfigure it?

gjaegy commented 3 weeks ago

Hi Thomas,

I've rolled back to the default configuration, and simply increased the precision of the output to 3 digits.

However, Z2M crashed this morning, after 2 days without crashing. Attached the log.

log (2).log

The only change I can think of: I've put the Linky cover back on the device, hence covering the ZLinky under a thin 1-2mm sheet of plastic. Could that decrease the signal and cause the transmission to fail ?

So, my problem is still unsolved I'm affraid :/

thomasGomes commented 3 weeks ago

OK, so we have the same configuration. My device is covered too but my RaspberryPi with HA is very close (under 30 centimeters). And to notice that the ZLinky is not the only device which doesn't provide data (some others, not all, can be unavailable in HA). When this happens, i don't succeed to find related logs...

Nerivec commented 3 weeks ago

If it can help, check out this thread, a few users had some past troubles with the ZLinky and tried various changes: https://github.com/Koenkk/zigbee2mqtt/issues/21198#issuecomment-1929344726

@thomasGomes Seems you have some other problem if devices are unavailable. Check your routers, see if one is causing trouble. You can use https://nerivec.github.io/z2m-ember-helper/ for a quick overview of your network based on a log file (the longer the runtime, the better the statistics).

@gjaegy Seems Z2M is forced to restart way too often because the ZLinky is bombarding it with retry messages for no reason (it eventually can't keep up), but the crash at the end appears unrelated:

[2024-06-14 08:46:40] error:    z2m:mqtt: MQTT error: getaddrinfo EAI_AGAIN core-mosquitto
[2024-06-14 08:46:56] error:    z2m:mqtt: Not connected to MQTT server!

Looks like some trouble on your other network (can't resolve the IP of the core-mosquitto hostname).

gjaegy commented 3 weeks ago

Hmmm I see, weird that getaddrinfo() fails. MQTT server is running on the same device (the Raspberry Pi) within HA. Why would that call fail then, any idea ?

I’ll contact the ZLinky author, maybe he can fix that “spam” issue in the case there is an issue.