Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.82k stars 1.64k forks source link

Zigbee2MQTT service stops working - FATAL ERROR #14853

Closed pauloon closed 1 year ago

pauloon commented 1 year ago

What happened?

After last update I've noticed Z2M is stopping service with a fatal error, out of the blue.

I'm using HASS.IO all updated, in a i5 machine with 8 Gb do memory and 128 GB SSD.

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

Just leave it running. After one or two days it stops working (service drops).

Zigbee2MQTT version

1.28.2 commit: unknown

Adapter firmware version

20220219

Adapter

SONOFF USB Dongle

Debug log

I did not have the debug log active when this happened. Posting normal log. ... Zigbee2MQTT:info 2022-11-07 11:45:15: MQTT publish: topic 'zigbee2mqtt/Smart Plug 15', payload '{"child_lock":"UNLOCK","current":0.04,"energy":6.45,"indicator_mode":"off/on","last_seen":"2022-11-07T11:45:13-03:00","linkquality":102,"power":0,"power_outage_memory":"restore","state":"ON","update":{"state":"idle"},"update_available":false}' <--- Last few GCs ---> [7:0x7fa21c7993c0] 61013533 ms: Mark-sweep 2044.2 (2085.3) -> 2042.2 (2085.3) MB, 2087.8 / 0.0 ms (average mu = 0.133, current mu = 0.010) allocation failure scavenge might not succeed [7:0x7fa21c7993c0] 61015639 ms: Mark-sweep 2044.3 (2085.3) -> 2042.2 (2085.3) MB, 2082.8 / 0.0 ms (average mu = 0.074, current mu = 0.011) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

pauloon commented 1 year ago

Hi,

Just had the same problem again. Attached the log file: log.txt

Please, help. Thanks, Paulo.

pauloon commented 1 year ago

I reinstalled Z2M all from zero and today it happened again:

2022-11-09T17:11:10.996Z zigbee-herdsman:adapter:zStack:znp:AREQ <-- AF - incomingMsg - {"groupid":0,"clusterid":61184,"srcaddr":55137,"srcendpoint":1,"dstendpoint":1,"wasbroadcast":0,"linkquality":61,"securityuse":0,"timestamp":11657102,"transseqnumber":0,"len":10,"data":{"type":"Buffer","data":[9,72,2,0,193,1,4,0,1,0]}} 2022-11-09T17:11:10.996Z zigbee-herdsman:controller:log Received 'zcl' data '{"frame":{"Header":{"frameControl":{"frameType":1,"manufacturerSpecific":false,"direction":1,"disableDefaultResponse":false,"reservedBits":0},"transactionSequenceNumber":72,"manufacturerCode":null,"commandIdentifier":2},"Payload":{"seq":49408,"dpValues":[{"dp":1,"datatype":4,"data":{"type":"Buffer","data":[0]}}]},"Command":{"ID":2,"parameters":[{"name":"seq","type":33},{"name":"dpValues","type":1011}],"name":"dataReport"}},"address":55137,"endpoint":1,"linkquality":61,"groupID":0,"wasBroadcast":false,"destinationEndpoint":1}' Zigbee2MQTT:debug 2022-11-09 14:11:10: Received Zigbee message from 'Presença 2', type 'commandDataReport', cluster 'manuSpecificTuya', data '{"dpValues":[{"data":{"data":[0],"type":"Buffer"},"datatype":4,"dp":1}],"seq":49408}' from endpoint 1 with groupID 0 <--- Last few GCs ---> [7:0x7f45b57b73c0] 84373002 ms: Scavenge (reduce) 2042.8 (2083.1) -> 2041.9 (2083.1) MB, 2.5 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure [7:0x7f45b57b73c0] 84373008 ms: Scavenge (reduce) 2043.0 (2083.1) -> 2042.1 (2083.1) MB, 2.7 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure [7:0x7f45b57b73c0] 84373015 ms: Scavenge (reduce) 2043.0 (2083.1) -> 2042.1 (2083.3) MB, 2.6 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

@Koenkk ... can you, please, help?

Thank you. Paulo.

Koenkk commented 1 year ago

What node version are you using?

pauloon commented 1 year ago

What node version are you using?

Thanks for helping me. Not sure what you mean by "node". Sorry for my ignorance. Are you talking about NodeRed ? I'm using version 13.5.2.

Follows attached a diagnostic file with some info about my system. config_entry-mqtt-5f9bbf81bfa29cb6923246ba774884c4.json.txt

Please, let me know.

Thanks, Paulo.

Koenkk commented 1 year ago

I mean node.js (try running node --version on your system)

pauloon commented 1 year ago

I mean node.js (try running node --version on your system)

Hi @Koenkk , I tried and it does not work: image

Maybe because I'm using Hass.io and NodeRed?

Thanks, Paulo.

Koenkk commented 1 year ago

how are you running z2m, via the addon?

pauloon commented 1 year ago

how are you running z2m, via the addon?

Dear @Koenkk , Yes, I've installed it via Add-On. image

See if this helps... image

Thanks, Paulo.

pauloon commented 1 year ago

If I turn off the watchdog it stops after a few hours. Just happened again... We are in a group of more than 2.000 HA users and more people are getting this too. Looks like Hass.io users...

pauloon commented 1 year ago

@Koenkk .... Look! With help from some friends, i think I got the node.js version: image

Will wait for your instructions.

Thanks, Paulo.

Koenkk commented 1 year ago

Can you provide the herdsman debug logging, from starting z2m till the error?

pauloon commented 1 year ago

Hi @Koenkk ,

Of course. Is that the one I get from Settings > System > Logs > Z2M, copy and save? image

Is there a way to access former logs? I'm waiting for it to crash again to get the log.

Thanks, Paulo.

Koenkk commented 1 year ago

I have not seen this screen before, how big is the log? (does it include everything from start)

pauloon commented 1 year ago

I have not seen this screen before, how big is the log? (does it include everything from start)

Dear @Koenkk , This is the screen I mentioned before. I'm not sure if it includes all of it. I just grabbed all of its contents so you can see: log.txt Please let me know if there are any other ways to get the logs. I could get the logs from the former crashes.

I'm still waiting so it crashes again but since I'm not home these days maybe it "cooled down" a little.

Thanks, Paulo.

Koenkk commented 1 year ago

These are not the herdsman debug logging. (they contain zigbee-herdsman like in https://github.com/Koenkk/zigbee2mqtt/issues/14853#issuecomment-1312524794).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

pauloon commented 1 year ago

These are not the herdsman debug logging. (they contain zigbee-herdsman like in #14853 (comment)).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

Dear @Koenkk ,

I GOT ANOTHER CRASH! I copied the entire log: log2.txt

Please, let me know.

Thanks, Paulo.

Koenkk commented 1 year ago

can you turn off Espelho 2 4gang and see if the error still occurs?

pauloon commented 1 year ago

can you turn off Espelho 2 4gang and see if the error still occurs?

I will remove it and restart Z2M, ok?

Thanks, Paulo.

Koenkk commented 1 year ago

Ok 👍

XanderTenBoden commented 1 year ago

I seem to have the same issue here (I opened an issue at https://github.com/Koenkk/zigbee2mqtt/issues/14992 before finding this one). I'm also using node version v16.17.1, so perhaps it has something to do with that indeed? I'm still trying to find the logs.

By the way, for me it happens much more often, sometimes 4 to 5 times on one evening.

pauloon commented 1 year ago

Ok 👍

I got another crash today, so I guess it is not "Espelho 2 4gang" fault. Now I reseted it again and I'm testing one more.

XanderTenBoden commented 1 year ago

Mine crashed as well. The Z2M logs gave me the following error:

Error: SRSP - AF - dataRequestExt after 6000ms
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)

These are not the herdsman debug logging. (they contain zigbee-herdsman like in #14853 (comment)).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

I am not able to retrieve the herdsman logs using this method, because the docker container just seems to crash in my case. So when I run docker ps, there isn't a z2m one running after the crash.

pauloon commented 1 year ago

Dear @Koenkk ,

After I removed the Espelho 2 4gang, it is still happening. Follows the last log extracted: log4.txt

Please, let me know.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon does the docker container also completely stop in your case?

pauloon commented 1 year ago

@pauloon does the docker container also completely stop in your case?

Hi @XanderTenBoden I think the docker is crashing too. One time that I ran docker ps and didn't see it. I'm going to let it happen again, and check to make sure.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon What I did is turning off the watchdog and then wait for it to stop working. As soon as that did, all my zigbee devices in lovelace turned to "unavailable", and the zigbee2mqtt add-on wasn't running anymore.

It seems like the issue has something to do with the z2m docker eating too much RAM. I've enabled the memory sensors and this is what I'm seeing in the history:

https://tenboden.com/ram.png

I do have a lot of devices connected, about 90. How many do you have? It might explain why the crashes are more frequently for me if you have less than that connected? 🤔

pauloon commented 1 year ago

@pauloon What I did is turning off the watchdog and then wait for it to stop working. As soon as that did, all my zigbee devices in lovelace turned to "unavailable", and the zigbee2mqtt add-on wasn't running anymore.

It seems like the issue has something to do with the z2m docker eating too much RAM. I've enabled the memory sensors and this is what I'm seeing in the history:

https://tenboden.com/ram.png

I do have a lot of devices connected, about 90. How many do you have? It might explain why the crashes are more frequently for me if you have less than that connected? 🤔

@XanderTenBoden , I have 51 devices. I'm also keeping watchdog disabled, otherwise it restarts Z2M. I'll try to check the memory history, same as you did.

Thanks, Paulo.

Koenkk commented 1 year ago

@pauloon I also see Presença 4 and Presença 2 are spamming a lot, can you also try to remove those?

pauloon commented 1 year ago

@pauloon I also see Presença 4 and Presença 2 are spamming a lot, can you also try to remove those?

Hi @Koenkk ,

I will try turn them off, to stop the traffic. These are mmWave presence sensors, which I have four. I know they push a lot of data from one expose called "target_distance" that pushes every second, but ONLY when there is someone being detected. As far as I understood it is not possible to turn off this reporting. I added this expose in the "filters" so it reduces MQTT data (I don't use it). I'm guessing these are not guilty, because the crashes that happened a few days ago I was traveling, and they were reporting much less data.

I'll keep you posted.

Thanks, Paulo.

pauloon commented 1 year ago

@Koenkk ,

Just got another crash. Very fast, this time.

Follows the log: log5.txt

I just had restarted HA and Z2M after turning off the sensors.

@XanderTenBoden , Indeed, the contained crashed too. But I did not get the same memory spique that you got. image

Please, let me know, Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon how do you get these logs?

When I use the docker logs CONTAINER_ID > log.txt 2>&1 command I only get like 15 lines max.

pauloon commented 1 year ago

@pauloon how do you get these logs?

When I use the docker logs CONTAINER_ID > log.txt 2>&1 command I only get like 15 lines max.

@XanderTenBoden ,

Settings > Add-Ons > Z2M > Logs image

Rgds, Paulo.

XanderTenBoden commented 1 year ago

Mine stopped responding as we speak, however the container is still running. Logs I got now are these:

[19:00:24] INFO: Preparing to start...
[19:00:25] INFO: Socat not enabled
[19:00:27] INFO: Starting Zigbee2MQTT...
Zigbee2MQTT:warn  2022-11-17 19:00:33: `permit_join` set to  `true` in configuration.yaml.
Zigbee2MQTT:warn  2022-11-17 19:00:33: Allowing new devices to join.
Zigbee2MQTT:warn  2022-11-17 19:00:33: Set `permit_join` to `false` once you joined all devices.
Zigbee2MQTT:error 2022-11-17 19:00:59: Publish 'set' 'state' to 'front_door_light' failed: 'Error: Command 0x00158d00032cadb0/1 genOnOff.off({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 39508 - 1 - 6 - 6 - 11 after 10000ms)'
Zigbee2MQTT:error 2022-11-17 19:01:02: Publish 'set' 'brightness' to 'living_room_desk_led_strip' failed: 'Error: Command 0xa4c138e0bfda4062/1 genLevelCtrl.moveToLevelWithOnOff({"level":254,"transtime":0}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'No network route' (205))'
Zigbee2MQTT:error 2022-11-17 19:01:24: Publish 'set' 'color' to 'living_room_desk_led_strip' failed: 'Error: Command 0xa4c138e0bfda4062/1 lightingColorCtrl.moveToColor({"transtime":0,"colorx":21168,"colory":21561}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'No network route' (205))'
Zigbee2MQTT:error 2022-11-17 20:52:08: Publish 'set' 'state' to 'office_spots' failed: 'Error: Command 0x00158d0006371aab/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 40507 - 1 - 148 - 6 - 11 after 10000ms)'
Zigbee2MQTT:error 2022-11-17 21:53:40: Publish 'set' 'state' to 'office_spots' failed: 'Error: Command 0x00158d0006371aab/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 40507 - 1 - 153 - 6 - 11 after 10000ms)'
Zigbee2MQTT:error 2022-11-17 21:54:20: Publish 'set' 'state' to 'office_spots' failed: 'Error: Command 0x00158d0006371aab/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 40507 - 1 - 164 - 6 - 11 after 10000ms)'
XanderTenBoden commented 1 year ago

Mine crashed as well. The Z2M logs gave me the following error:

Error: SRSP - AF - dataRequestExt after 6000ms
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)

These are not the herdsman debug logging. (they contain zigbee-herdsman like in #14853 (comment)). See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

I am not able to retrieve the herdsman logs using this method, because the docker container just seems to crash in my case. So when I run docker ps, there isn't a z2m one running after the crash.

I'm looking at this error now that I had yesterday, and when googling on that I get a result on this page that seems to crash on the exact same line (waitress.ts:64:35): https://github.com/Koenkk/zigbee2mqtt/issues/9136#issuecomment-944098300

Not sure if I'm onto something though, but perhaps it's worth mentioning?

XanderTenBoden commented 1 year ago

I've got a new crash and log here:

log-z2m-xander-18-11-2022.txt

@pauloon I seem to have a lot less debugging data. How is your Z2M debugging config set up?

pauloon commented 1 year ago

I've got a new crash and log here:

log-z2m-xander-18-11-2022.txt

@pauloon I seem to have a lot less debugging data. How is your Z2M debugging config set up?

Dear @XanderTenBoden ,

I guess is because I have these parameters configured (Herdsman Debug Active):

Settings > Add-Ons > Z2M > Configuration tab image

And in my Z2M "configuration.yaml":

advanced: log_level: debug

Hope this helps.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

Ah, That checkbox was hidden in my case, I turned it to "on".

XanderTenBoden commented 1 year ago

It crashed again. I do have a little bit more logging now, but since the docker container crashed again, I can only retrieve the last lines that are shown in the Z2M add-on's interface in HA:

https://tenboden.com/log-z2m-xander-18-11-2022-2.txt

@Koenkk do you notice anything strange in here?

pauloon commented 1 year ago

@Koenkk @XanderTenBoden Just got another crash...

The memory got the behaviour you said before: image

Follows the log: log6.txt

The container also crashed. The mmWave Tuya presence sensors were disabled (I don't believe it is a device specific problem).

Please, let me know if I can help anyway.

Thanks, Paulo.

pauloon commented 1 year ago

It really do looks like there is a bug that makes something consume memory until it crashes: image

XanderTenBoden commented 1 year ago

Mine just crashed 2 minutes (!) after restarting it.

log-z2m-xander-19-11-2022.txt

5 minutes later: Aaannnd again. Basically, my Home Assistant is completely unusable by now 🥲

pauloon commented 1 year ago

@Koenkk ... Please, help us. I have other fellows that are suffering the same with the HassOS. Memory use keeps building up: image

Probably it is a glitch or bug to cause this. If you need remote access to my HA to investigate, we can arrange that.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon what I've also noticed is that before actually crashing the docker container, my devices stop responding to any interaction. In other words: Z2M is still 'running', but lights can't be toggled and stuff like that. Do you see the same behaviour?

pauloon commented 1 year ago

@pauloon what I've also noticed is that before actually crashing the docker container, my devices stop responding to any interaction. In other words: Z2M is still 'running', but lights can't be toggled and stuff like that. Do you see the same behaviour?

I didn't notice that, but it might be possible. It took a while for me to check the Z2M after noticing that it had crashed. Maybe it was "frozen" before crashing. Not sure.

Paulo.

pauloon commented 1 year ago

PS: I'm been talking to people and almost everyone has watchdog on, so I guess there's more people being impacted by this problem, but they just don't notice. Here, if I leave watchdog on, I don't notice anything because I guess it gets restarted quickly.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

PS: I'm been talking to people and almost everyone has watchdog on, so I guess there's more people being impacted by this problem, but they just don't notice. Here, if I leave watchdog on, I don't notice anything because I guess it gets restarted quickly.

Thanks, Paulo.

Yes maybe. I found out about the issue because of the non-responsiveness I mentioned before. And because some automations that were sending TTS messages to my Google Home tend to trigger when restarting Z2M.

pauloon commented 1 year ago

Memory is coming to the limit... image

Let's see if it crashes again....

pauloon commented 1 year ago

Just crashed again.... image

log7.txt

Container crashed too...

pauloon commented 1 year ago

Memory seems to keep building up constantly: image

Now I'm making another test... I've made a new install of HassOS on another computer, pretty much clean, just MQTT, M2Z and Node-Red. The memory seems to have the same behaviour, but in a much slower pace (maybe because I've added only two devices?): image

Could be it an indication of a bug with the HassOS and the Z2M add-on?

XanderTenBoden commented 1 year ago

@pauloon it seems like i've resolved the issue here.

The issue was that I had some power plugs with consumption-statistics that were pushing these statistics every second. After changing this to once every 10 seconds, my Z2M no longer seems to crash and the memory is at a table level.