Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.75k stars 1.64k forks source link

Z2M going down by itself with an apparent reason #19371

Open nicola-spreafico opened 10 months ago

nicola-spreafico commented 10 months ago

What happened?

I have Z2M installed in my Home Assistant instance, running on a Raspberry 3

As USB dongle I have "SONOFF ZigBee 3.0 USB Dongle Plus, TI CC2652P" flashed with firmware coordinator https://github.com/Koenkk/Z-Stack-firmware/blob/master/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20230507.zip

Right now I paired 4 Aqara door sensor, 2 Aqara temperature sensor and 3 smart plugs.

The Z2M is shutting down by itself without an apparent reason. When watchdog was enabled you can see that once every hour the process started again: image

Now I disabled watchdog and I enabled zigbee_herdsman_debug. Here is the log of the last start, you can see at the end that no error at all is listed, but the process went down anyway: log.txt

To stay on top I for sure keep Watchdog enabld, but this is not a permanente solution as I need to discover why is this happening. I also checked the HA logs at the time of the last log entry from Z2M log, but even here I'm getting some errors related to other components but nothing that may explain why the process is going down.

You can see here that the data became unavailable at 14.01.13 image

Last log entry from Z2M log is

debug 2023-10-21 14:00:33: Received Zigbee message from 'piano1_cucina_microonde_presa', type 'attributeReport', cluster 'genBasic', data '{"65506":54,"65508":0,"appVersion":192}' from endpoint 1 with groupID 0

And nothing in the HA logs around that timestamp

ValueError: Sensor sensor.download_mbps has device class 'None', state class 'None' unit 'Mbps' and suggested precision 'None' thus indicating it has a numeric value; however, it has the non-numeric value: 'In attesa' (<class 'str'>)
2023-10-21 13:51:10.516 ERROR (MainThread) [homeassistant.components.template.template_entity] TemplateError('ValueError: Template error: as_timestamp got invalid input 'unknown' when rendering template '{% if (state_attr('sensor.fritzbox','uptime'))  == none %}
  In attesa
{% else %}
  {{ as_timestamp(state_attr('sensor.fritzbox','uptime')) | timestamp_custom('%d/%m/%Y') }}
{% endif %}' but no default was specified') while processing template 'Template<template=({% if (state_attr('sensor.fritzbox','uptime'))  == none %}
  In attesa
{% else %}
  {{ as_timestamp(state_attr('sensor.fritzbox','uptime')) | timestamp_custom('%d/%m/%Y') }}
{% endif %}) renders=6>' for attribute '_attr_native_value' in entity 'sensor.ultimo_riavvio_fritz'
2023-10-21 13:51:11.986 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_statistiche_fritzbox }}'
2023-10-21 13:51:11.996 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_centro_controllo_elettrodomestici }}'
2023-10-21 13:51:12.024 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_centro_controllo_elettrodomestici }}'
2023-10-21 13:51:12.039 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_centro_controllo_energia }}'
2023-10-21 13:51:12.068 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_centro_controllo_elettrodomestici }}'
2023-10-21 13:51:12.139 ERROR (MainThread) [homeassistant.helpers.template] Template variable error: 'value_json' is undefined when rendering '{{ value_json.last_update_centro_controllo_elettrodomestici }}'
2023-10-21 14:01:28.639 ERROR (MainThread) [homeassistant.components.hassio.handler] Timeout on /core/stats request
2023-10-21 14:01:28.662 ERROR (MainThread) [homeassistant.components.hassio] Error fetching hassio data: Error on Supervisor API: 

Any help from you will be much appreciate as at this time I don't even know why is going down

What did you expect to happen?

The process remains operative

How to reproduce it (minimal and precise)

Unknown cause at the moment

Zigbee2MQTT version

Zigbee2MQTT Current version: 1.33.1-1

Adapter firmware version

Tipo Coordinator zStack3x0 | Revisione Coordinator 20230507

Adapter

SONOFF ZigBee 3.0 USB Dongle Plus, TI CC2652P Coordinator

Debug log

No response

passy05 commented 10 months ago

You should downgrade the firmware of your coordinator (see https://github.com/Koenkk/Z-Stack-firmware/discussions/478). Unfortunately for me it did not completely solve my problem (recently devices not responsive or not very responsive in the evening) but I don't get anymore timeout problem on Z2M restart.

nicola-spreafico commented 10 months ago

@passy05 I disabled most of the packages I have in packages folder (basically I renamed the yaml files with a different extension) and it now seems to work, or at least is now 24h and it hasn't stopped yet. I'm not sure what kind of hint this thing may provide, but I enabled all the debug logs I can think of and there is no trace at all of any malicious event that may stop the process to be running.

nicola-spreafico commented 10 months ago

Hello @Koenkk, my problem is slighty different from the others, I'm not getting actual disconnection from the devices, but instead I'm getting the Z2M instance being rebooted every 60 minutes, is too precise to be a coincidence.

You can see from here that at every boot it creates a new log folder image but the problem is that I'm not getting any error in the logs, the file simply ends with the last action and then it creates the new folder

Here are my latest details: Versione Zigbee2MQTT 1.33.1 commit: unknown Tipo Coordinator zStack3x0 Revisione Coordinator 20230923

Mind that I was previously using firmware 20230507, I tried to update but the result is the same. I'm getting the process to be booted every time only because I have watchdog which monitors the process, otherwise I will have the process constantly down due to this problem.

I'm not having actual ideas of what is the problem here, any suggestion?

Koenkk commented 10 months ago

@nicola-spreafico from what I can see this indeed doesn't happen because of a z2m crash, z2m is simply killed (externally). How are you running z2m? Maybe your system runs out of memory?

nicola-spreafico commented 10 months ago

@Koenkk let me give you some additional information

I'm running Raspberry Pi 3 with HAOS and Z2M is installed as add-on. The memory is 1GB.

image If it were a memory problem I would expect downs in a more random fashion, whereas here there seems to be a very periodic 60-minute cadence (give or take a few seconds)

Right now the memory seems to be in standard use image

Koenkk commented 10 months ago

I guess the HA supervisor then kills the addon for some reason, you should check those logs to debug further.

nicola-spreafico commented 10 months ago

@Koenkk you can see here the in the time period of when the Z2M was rebooted (10.59) the memory monitor is like is missing data, due to the straight line: image

nicola-spreafico commented 10 months ago

@Koenkk after another deep in the logs, I finally found that may be of help

ov 01 13:56:07 homeassistant kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-1b2f46f996d7235aef6ebaaba5fdc477dd3d9e887f766051966bf42e8006bf8f.scope,task=node,pid=538034,uid=0
Nov 01 13:56:07 homeassistant kernel: Out of memory: Killed process 538034 (node) total-vm:324124kB, anon-rss:56172kB, file-rss:4924kB, shmem-rss:0kB, UID:0 pgtables:368kB oom_score_adj:200
Nov 01 13:56:07 homeassistant kernel: audit: type=1334 audit(1698846966.368:1924): prog-id=482 op=UNLOAD
Nov 01 13:56:05 homeassistant systemd[1]: docker-1b2f46f996d7235aef6ebaaba5fdc477dd3d9e887f766051966bf42e8006bf8f.scope: A process of this unit has been killed by the OOM killer.
Nov 01 13:56:06 homeassistant systemd[1]: docker-1b2f46f996d7235aef6ebaaba5fdc477dd3d9e887f766051966bf42e8006bf8f.scope: Deactivated successfully.
Nov 01 13:56:06 homeassistant systemd[1]: docker-1b2f46f996d7235aef6ebaaba5fdc477dd3d9e887f766051966bf42e8006bf8f.scope: Consumed 2min 50.186s CPU time.
Nov 01 13:56:14 homeassistant dockerd[537]: time="2023-11-01T13:56:13.915143999Z" level=info msg="ignoring event" container=1b2f46f996d7235aef6ebaaba5fdc477dd3d9e887f766051966bf42e8006bf8f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 01 13:56:16 homeassistant kernel: audit: type=1325 audit(1698846976.331:1925): table=nat:390 family=2 entries=1 op=nft_unregister_rule pid=545378 subj=unconfined comm="iptables"
Nov 01 13:56:16 homeassistant kernel: audit: type=1300 audit(1698846976.331:1925): arch=c00000b7 syscall=211 success=yes exit=512 a0=3 a1=7fe376fbf8 a2=0 a3=1 items=0 ppid=537 pid=545378 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables" exe="/usr/sbin/xtables-nft-multi" subj=unconfined key=(null)

Can this be of help? It sound like the memory was depleted and then some process were killed. But I'm not sure if Z2M was the process to fill the entire memory (cause) of it was shutdown as the memory was no more (effect)

image

For sure being something that occurs exaclty once every 60 minutes make you think...

Koenkk commented 10 months ago

Could it be another process starts every 60 min that uses all the memory? Try stopping some other services and see of the issue persists, anyway this doesn't look like a z2m issue.

gitaday commented 9 months ago

Hello,

Here is my understanding:

Problems: P1) The Z2M of @nicola-spreafico crashes every 60 minutes because of an OutOfMemory P2) The root cause of the Out Of Memory is a scheduled maintenance task in the Linux kernel, that needs some extra memory to run itself and therefore it decides to kill the best candidate to be killed... indeed the Z2M. Reference: Linux Out of Memory killer

Solutions: 1) Buy a new hardware with more RAM. 2GB should be fine but the more the best. 2) Disable / Delete most of the Add-ons and integrations on your rpi3. If you are not convinced by reading this solution, you might give it a try for 48hours and you will see that Z2M won't crash anymore. 3) To be confirmed: maybe there is a way to modify the Linux maintenance tasks https://www.kernel.org/doc/gorman/html/understand/understand016.html You could maybe 3.1 --> modify the frequency. (i.e. every 12 hours instead of every 60 minutes) 3.2 --> modify the criteria for memory thresholds. Now it looks like memory thresholds is somewhere between 80% and 90%

Take care!

nicola-spreafico commented 9 months ago

Hello @gitaday , thank you for your reply.

From my last I used black friday to buy a new hardware, a mini PC with 16gb of RAM so that I can install multiple VM and LXC. I also separated Z2M from HA. 1GB total of ram was in any case not enough to be expandable in the future.

image image

Since then It is running smooth, I had to reboot recently due to a change of configuration, but besides this it is running without crash since 2023-12-06.20-08-12

bartplessers commented 8 months ago

Seems I have same problem here. Running

Hardware RPI3 B Rev 1.2

CC2652 (Electrolama zig-a-zig-ah! (zzh!)) Firmware: CC2652R_coordinator_20221226

Home Assistant Core 2024.1.2 Supervisor 2023.12.0 Operating System 11.3 Frontend 20240104.0

Zigbee2MQTT Current version: 1.35.1-1

Before: I was running HA on DietPi, supervised. Everything worked smoothly Now: switched to HAOS --> problems started. Random crashes of Z2M without any (debug) logs in Z2M

I will activate "Watchdog" to see if the same time pattern occurs

bartplessers commented 8 months ago

mmmmm. Seems to be more frequent here... 2024-01-09_11-17-15

ottoketney commented 8 months ago

I have the same behaviour

Zigbee2MQTT version : 1.35.1 Coordinator type : EZSP v9 Coordinator revision : 7.1.1.0 build 273

gitaday commented 6 months ago

Hello,

@bartplessers and @ottoketney ,

You should give a try to this solution (workaround) number four (following the 3 solutions in the previous post):

  1. Monitor and then increase your SWAP size. Reference: https://www.youtube.com/watch?v=IvtnCB-I4dw&t=24s&ab_channel=kian

Bye!

bartplessers commented 6 months ago

Thanx @gitaday ! But meanwhile, I switched to a NUC, proxmox and HAOS. No problems since than.

But thanx for you suggestion

Kind regards Bartplessers

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days