Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
12.13k stars 1.68k forks source link

Zigbee2MQTT service stops working - FATAL ERROR #14853

Closed pauloon closed 1 year ago

pauloon commented 2 years ago

What happened?

After last update I've noticed Z2M is stopping service with a fatal error, out of the blue.

I'm using HASS.IO all updated, in a i5 machine with 8 Gb do memory and 128 GB SSD.

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

Just leave it running. After one or two days it stops working (service drops).

Zigbee2MQTT version

1.28.2 commit: unknown

Adapter firmware version

20220219

Adapter

SONOFF USB Dongle

Debug log

I did not have the debug log active when this happened. Posting normal log. ... Zigbee2MQTT:info 2022-11-07 11:45:15: MQTT publish: topic 'zigbee2mqtt/Smart Plug 15', payload '{"child_lock":"UNLOCK","current":0.04,"energy":6.45,"indicator_mode":"off/on","last_seen":"2022-11-07T11:45:13-03:00","linkquality":102,"power":0,"power_outage_memory":"restore","state":"ON","update":{"state":"idle"},"update_available":false}' <--- Last few GCs ---> [7:0x7fa21c7993c0] 61013533 ms: Mark-sweep 2044.2 (2085.3) -> 2042.2 (2085.3) MB, 2087.8 / 0.0 ms (average mu = 0.133, current mu = 0.010) allocation failure scavenge might not succeed [7:0x7fa21c7993c0] 61015639 ms: Mark-sweep 2044.3 (2085.3) -> 2042.2 (2085.3) MB, 2082.8 / 0.0 ms (average mu = 0.074, current mu = 0.011) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

eshchar commented 1 year ago

having this issue using Zigbee2MQTT here is the last part of the logs

'{"battery":100,"linkquality":212,"occupancy":true,"power_outage_count":26,"voltage":3035}' Zigbee2MQTT:info 2023-02-19 20:58:49: MQTT publish: topic 'zigbee2mqtt/Motion sensor', payload '{"battery":100,"linkquality":212,"occupancy":false,"power_outage_count":26,"voltage":3035}' Zigbee2MQTT:info 2023-02-19 21:08:48: MQTT publish: topic 'zigbee2mqtt/Button', payload '{"action":null,"battery":100,"click":null,"linkquality":200,"power_outage_count":637,"voltage":3042}' Zigbee2MQTT:info 2023-02-19 21:32:56: MQTT publish: topic 'zigbee2mqtt/Motion sensor', payload '{"battery":100,"linkquality":212,"occupancy":false,"power_outage_count":26,"voltage":3035}' Zigbee2MQTT:error 2023-02-19 22:24:14: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:24:41: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:25:40: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:26:52: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:28:08: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:29:33: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:31:35: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:33:37: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:35:17: Not connected to MQTT server! Zigbee2MQTT:error 2023-02-19 22:38:38: Not connected to MQTT server! <--- Last few GCs ---> [8:0x7fb8e1fe33c0] 46065449 ms: Mark-sweep 1851.4 (2008.8) -> 1835.4 (2002.0) MB, 3565.4 / 0.7 ms (average mu = 0.174, current mu = 0.143) allocation failure scavenge might not succeed [8:0x7fb8e1fe33c0] 46068097 ms: Mark-sweep 1851.1 (2002.0) -> 1835.3 (1999.5) MB, 2564.2 / 0.8 ms (average mu = 0.116, current mu = 0.032) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

I have only 2 ZigBee devices the xiaomi motion sensor and button. connected to Sonoff hub with ZHA ZBBridge Tasmota version 12.4.0(zbbridge)

oversc0re commented 1 year ago

An issue identical to the eschar's has occurred to me recently. After a week of normal operation, MQTT connection dropped and out of memory occurred. I am running Zigbee2mqtt on very limited resources (Rpi B), docker installation via docker-compose. Should we open a new ticket for this?

segdy commented 1 year ago

It seems I am having a similar issue which could be related to https://github.com/Koenkk/zigbee2mqtt/issues/12732

I am running zigbee2mqtt on a RPi Zero W which worked flawlessly for weeks. Just yesterday, after I updated the MQTT server on a different machine this started to happen (seemingly).

I am running with herdsman debug but no obvious issues. Before this happens, the node process runs with 100% CPU and is extremely laggy and unresponsive. After ~30min it is killed with this message:

  zigbee-herdsman:controller:endpoint Request Queue (0x94deb8fffe7bc6f1/1): send checkinRsp request immediately (sendWhen=immediate) +3ms
  zigbee-herdsman:adapter:zStack:unpi:parser --- parseNext [254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,170,0,124,0,0,3,9,58,0,65,229,28,97,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,51,48,124,0,0,3,9,12,0,198,44,28,49,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,128,62,124,0,0,3,9,98,0,108,155,28,254,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,30,128,124,0,0,3,9,78,0,62,232,28,19,254,23,68,129,0,0,32,0,178,110,1,1,0,51,0,137,163,124,0,0,3,9,45,0,178,110,29,173,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,100,243,124,0,0,3,9,89,0,108,155,28,213,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,178,245,124,0,0,3,9,59,0,65,229,28,141,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,219,38,125,0,0,3,9,13,0,198,44,28,207,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,40,53,125,0,0,3,9,99,0,108,155,28,93,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,117,118,125,0,0,3,9,79,0,62,232,28,142,254,23,68,129,0,0,32,0,178,110,1,1,0,54,0,32,153,125,0,0,3,9,46,0,178,110,29,57,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,180,233,125,0,0,3,9,90,0,108,155,28,29,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,122,235,125,0,0,3,9,60,0,65,229,28,93,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,34,30,126,0,0,3,9,14,0,198,44,28,14] +183ms
  zigbee-herdsman:adapter:zStack:unpi:parser --> parsed 23 - 2 - 4 - 129 - [0,0,32,0,241,41,1,1,0,21,0,170,0,124,0,0,3,9,58,0,65,229,28] - 97 +2ms
  zigbee-herdsman:adapter:zStack:znp:AREQ <-- AF - incomingMsg - {"groupid":0,"clusterid":32,"srcaddr":10737,"srcendpoint":1,"dstendpoint":1,"wasbroadcast":0,"linkquality":21,"securityuse":0,"timestamp":8126634,"transseqnumber":0,"len":3,"data":{"type":"Buffer","data":[9,58,0]}} +176ms
  zigbee-herdsman:controller:log Received 'zcl' data '{"frame":{"Header":{"frameControl":{"frameType":1,"manufacturerSpecific":false,"direction":1,"disableDefaultResponse":false,"reservedBits":0},"transactionSequenceNumber":58,"manufacturerCode":null,"commandIdentifier":0},"Payload":{},"Command":{"ID":0,"parameters":[],"name":"checkin"}},"address":10737,"endpoint":1,"linkquality":21,"groupID":0,"wasBroadcast":false,"destinationEndpoint":1}' +136ms
  zigbee-herdsman:controller:device:log check-in from 0x8cf681fffe2a1662: declining fast-poll +135ms
  zigbee-herdsman:controller:endpoint Command 0x8cf681fffe2a1662/1 genPollCtrl.checkinRsp({"startFastPolling":false,"fastPollTimeout":0}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) +123ms
  zigbee-herdsman:controller:endpoint Request Queue (0x8cf681fffe2a1662/1): send checkinRsp request immediately (sendWhen=immediate) +2ms
  zigbee-herdsman:adapter:zStack:unpi:parser --- parseNext [254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,51,48,124,0,0,3,9,12,0,198,44,28,49,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,128,62,124,0,0,3,9,98,0,108,155,28,254,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,30,128,124,0,0,3,9,78,0,62,232,28,19,254,23,68,129,0,0,32,0,178,110,1,1,0,51,0,137,163,124,0,0,3,9,45,0,178,110,29,173,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,100,243,124,0,0,3,9,89,0,108,155,28,213,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,178,245,124,0,0,3,9,59,0,65,229,28,141,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,219,38,125,0,0,3,9,13,0,198,44,28,207,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,40,53,125,0,0,3,9,99,0,108,155,28,93,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,117,118,125,0,0,3,9,79,0,62,232,28,142,254,23,68,129,0,0,32,0,178,110,1,1,0,54,0,32,153,125,0,0,3,9,46,0,178,110,29,57,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,180,233,125,0,0,3,9,90,0,108,155,28,29,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,122,235,125,0,0,3,9,60,0,65,229,28,93,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,34,30,126,0,0,3,9,14,0,198,44,28,14] +94ms
  zigbee-herdsman:adapter:zStack:unpi:parser --> parsed 5 - 2 - 5 - 196 - [68,68,1,198,44] - 111 +2ms
  zigbee-herdsman:adapter:zStack:znp:AREQ <-- ZDO - srcRtgInd - {"dstaddr":17476,"relaycount":1,"relaylist":[11462]} +96ms
  zigbee-herdsman:adapter:zStack:unpi:parser --- parseNext [254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,51,48,124,0,0,3,9,12,0,198,44,28,49,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,128,62,124,0,0,3,9,98,0,108,155,28,254,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,30,128,124,0,0,3,9,78,0,62,232,28,19,254,23,68,129,0,0,32,0,178,110,1,1,0,51,0,137,163,124,0,0,3,9,45,0,178,110,29,173,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,100,243,124,0,0,3,9,89,0,108,155,28,213,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,178,245,124,0,0,3,9,59,0,65,229,28,141,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,219,38,125,0,0,3,9,13,0,198,44,28,207,254,23,68,129,0,0,32,0,249,239,1,1,0,91,0,40,53,125,0,0,3,9,99,0,108,155,28,93,254,5,69,196,161,26,1,62,232,232,254,23,68,129,0,0,32,0,161,26,1,1,0,54,0,117,118,125,0,0,3,9,79,0,62,232,28,142,254,23,68,129,0,0,32,0,178,110,1,1,0,54,0,32,153,125,0,0,3,9,46,0,178,110,29,57,254,5,69,196,37,6,1,108,155,81,254,5,69,196,241,41,1,65,229,249,254,23,68,129,0,0,32,0,37,6,1,1,0,87,0,180,233,125,0,0,3,9,90,0,108,155,28,29,254,23,68,129,0,0,32,0,241,41,1,1,0,21,0,122,235,125,0,0,3,9,60,0,65,229,28,93,254,5,69,196,68,68,1,198,44,111,254,23,68,129,0,0,32,0,68,68,1,1,0,76,0,34,30,126,0,0,3,9,14,0,198,44,28,14] +5ms
  zigbee-herdsman:adapter:zStack:unpi:parser --> parsed 23 - 2 - 4 - 129 - [0,0,32,0,68,68,1,1,0,76,0,51,48,124,0,0,3,9,12,0,198,44,28] - 49 +2ms
  zigbee-herdsman:adapter:zStack:znp:AREQ <-- AF - incomingMsg - {"groupid":0,"clusterid":32,"srcaddr":17476,"srcendpoint":1,"dstendpoint":1,"wasbroadcast":0,"linkquality":76,"securityuse":0,"timestamp":8138803,"transseqnumber":0,"len":3,"data":{"type":"Buffer","data":[9,12,0]}} +7ms

<--- Last few GCs --->

[1278:0x4583a58]  2596805 ms: Mark-sweep 120.0 (129.3) -> 119.5 (129.0) MB, 8820.0 / 0.0 ms  (average mu = 0.092, current mu = 0.023) allocation failure; scavenge might not succeed
[1278:0x4583a58]  2597020 ms: Scavenge 120.2 (129.0) -> 119.7 (129.0) MB, 13.1 / 0.0 ms  (average mu = 0.092, current mu = 0.023) allocation failure;
[1278:0x4583a58]  2597095 ms: Scavenge 120.2 (129.0) -> 119.7 (129.3) MB, 4.5 / 0.0 ms  (average mu = 0.092, current mu = 0.023) allocation failure;

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Aborted
zigbee2mqtt@rfpi:~$

@Koenkk any advice?

zigbee2mqtt@rfpi:~$ node --version
v18.15.0

I updated to latest version (via git pull and npm ci) but issue still occurs...

Koenkk commented 1 year ago

Can you provide the herdsman debug logging from starting z2m until this and provide me the time cpu jumps to 100%?

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

segdy commented 1 year ago

Thanks @Koenkk One issue: CPU is at 100% (or a bit less sometimes) from the beginning on. z2m behaves extremely laggy and after ~30min it dies with a message like above.

Note it has been working flawlessly for weeks and the issue randomly showed up 2 days ago. Apart from upgrading mosquitto, I also upgraded the firmware of some devices (mainly Ikea blinds and plugs). Is there a way to isolate the issue without having to rebuild my entire Zigbee network (>>50 devices)?

I have created a full log (few MBs, z2m ran for ~55min at 100% CPU before dying with the message above). What is the best way to provide you with the log file (in case it's even useful)? Since it contains lots of private info I'd prefer not uploading it where it's publically accessible, if possible at all.

Koenkk commented 1 year ago

You can send it on telegram (@koenkk)

segdy commented 1 year ago

@Koenkk Thanks! I tried installing Telegram but I get an error message when registering with my phone number (Im not using Telegram yet). Is it (hopefully) also ok to just use this link?

https://www.dropbox.com/s/xdsbv9mdn1pvzc9/debug.log.gz?dl=0

I'll remove the file once you downloaded it.

Thank you very much!!

Koenkk commented 1 year ago

This is the normal z2m debug logging, I need the herdsman debug logging.

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

segdy commented 1 year ago

Oh indeed. It was on the screen but got lost in the file. I guess you want to say “ Note that this is only logged to STDERR and not to log files.”

I added “2>&1|tee…” and it goes into the file now. I’m repeating…

segdy commented 1 year ago

Ok, now the full log is here: https://www.dropbox.com/s/qymi5hlssci74qg/debug.log.gz?dl=0

Koenkk commented 1 year ago

Can you try https://github.com/Koenkk/zigbee2mqtt/issues/17923 ? I see the IKEA blinds are checking-in a lot.

segdy commented 1 year ago

Thanks, this nearly sounds like it! (Issue exactly started to happen after I did these upgrades).

I have done

git fetch origin dev
git checkout latest-dev
git pull
npm ci

which I believe should get me the patch (version shown as "1.31.0-dev commit: fd1622b") but the CPU is still at 100%...

Koenkk commented 1 year ago

Can you re-configure your blinds and make sure that succeeds? (yellow refresh button in the z2m frontend -> device page)

segdy commented 1 year ago

Thanks for the suggestion.

Ok I actually removed the batteries from all the IKEA blinds. That should make them silent, right?

But yeah, sadly it’s still at 100%…

Turns out the startup took 10min. CPU is down to normal now (3-5% for node). Yaaaaaay!!!

regarding the dev branch I’m actually not sure if I really got it. I think the documentation is outdated. I posted a question here: https://github.com/Koenkk/zigbee2mqtt/discussions/17938

But it might just be worth waiting for the fox to reach master branch. How long do you think this will take roughly? Days, weeks, months?

Koenkk commented 1 year ago

I will create a hotfix release today.

OlegKarasik commented 1 year ago

Hi @Koenkk,

Looks like the same issue reproduces for Perenio PEHPL0X plugs. As soon as I add them into network, they report a lot of information and eventually MQTT connection is lost and Z2M crashes with out of memory.