MQTT keepalive timeouts

p4r4ll4x commented 4 weeks ago

What happened?

I had to reboot my server a few days ago and started seeing my devices becoming unavailable in Home Assistant. I had a very stable network without issues before the reboot.

After checking the logs, the devices become unavailable because MQTT keepalive timeouts a few times per hour:

[2024-10-17 07:01:13] error:    z2m: MQTT error: Keepalive timeout
[2024-10-17 07:01:14] error:    z2m: Not connected to MQTT server!
[2024-10-17 07:01:14] error:    z2m: Cannot send message: topic: 'zigbee2mqtt/plug_desk_speakers', payload: '{"child_lock":"UNLOCK","countdown":0,"current":0.07,"energy":68.61,"indicator_mode":"off/on","last_seen":"2024-10-17T07:01:14+02:00","linkquality":69,"power":0,"power_outage_memory":"restore","state":"ON","update":{"installed_version":67,"latest_version":67,"state":"idle"},"update_available":false,"voltage":226}
[2024-10-17 07:01:16] error:    z2m: Not connected to MQTT server!

[2024-10-17 07:20:48] error: z2m: MQTT error: Keepalive timeout

[2024-10-17 08:55:07] error:    z2m: MQTT error: Keepalive timeout
[2024-10-17 08:55:07] error:    z2m: Not connected to MQTT server!
[2024-10-17 08:55:07] error:    z2m: Cannot send message: topic: 'zigbee2mqtt/plug_desk_screens', payload: '{"child_lock":"UNLOCK","countdown":0,"current":0.62,"elapsed":5884,"energy":246.77,"indicator_mode":"off/on","last_seen":"2024-10-17T08:55:07+02:00","linkquality":36,"power":78,"power_outage_memory":"restore","state":"ON","update":{"installed_version":67,"latest_version":67,"state":"idle"},"update_available":false,"voltage":231}
[2024-10-17 08:55:08] error:    z2m: Not connected to MQTT server!
[2024-10-17 08:55:08] error:    z2m: Cannot send message: topic: 'zigbee2mqtt/plug_cam_woonkamer', payload: '{"child_lock":"UNLOCK","countdown":0,"current":0,"elapsed":5932,"energy":240.21,"indicator_mode":"off/on","last_seen":"2024-10-17T08:55:08+02:00","linkquality":36,"power":0,"power_outage_memory":"restore","state":"OFF","update":{"installed_version":67,"latest_version":67,"state":"idle"},"update_available":false,"voltage":230}

I have debug logging on and don't really see other messages right before the MQTT issues, sometimes the following error pop up:

[2024-10-17 07:04:30] debug:    zh:zstack: Data confirm error (0xa4c138222bad1e06:39508,225,0)
[2024-10-17 07:04:48] debug:    zh:zstack: Data confirm error (0x84fd27fffec22d77:12517,205,0)
[2024-10-17 07:04:48] debug:    zh:controller:endpoint: Error: ZCL command 0x84fd27fffec22d77/1 genBasic.read(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"reservedBits":0,"writeUndiv":false}) failed (Data request failed with error: 'No network route' (205))

Mosquitto is running on Home Assistant as an addon, Z2M is running in docker on another virtualised guest, on the same proxmox host. I have other services (DSMR, tasmota, frigate,...) using MQTT, those services are not showing issues.

I run Gatus and i added a TCP check on port 1883 to my MQTT server every second and i consequently get 1ms response times. I also added a check for https to Z2M and get 1000ms to 2000ms reponse times and sometimes timeouts (more than 10 seconds).

Any stept i can take to troubleshoot this issue?

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

No response

Zigbee2MQTT version

1.40.2 commit: e06848d

Adapter firmware version

20210708

Adapter

zzh! CC2652R

Setup

Proxmox

Debug log

No response

mra282 commented 4 weeks ago

I can confirm I am having the exact same issue after rebooting host system. No config changes made just a restart to replace ups.

camlin commented 3 weeks ago

For info, I just upgraded from 1.35.1 to 1.40.2 and I immediately started to see the same issue.

Just in case it helps, my coordinator is a Sonoff ZBDongle E (EZSP v8) and the MQTT server is the version embedded in a Firebrick FB2900 (V2.01.018).

As soon as I reverted my zigbee2MQTT image back to 1.35.1 these issues disappeared again...

camlin commented 3 weeks ago

Complement of info. Since I reverted the MQTT connection has remained stable (17 hours, as seen by the MQTT server):

With 1.40.2, it was not lasting 2 minutes... The current connection is also currently fully functional (at least I am getting all of my Zigbee2MQTT events. Admittedly, in my current setup, I only receive. I do not send anything. I have been wondering if that could accidentally trip the keep alive. But I would have expected it to send ping messages if there is not enough traffic).

camlin commented 3 weeks ago

For info, I also tried with 1.40.0 same problem...

Will probably need to look at the code diff at some point to try and narrow down when the issue (or incompatibility with the way some MQTT servers handle/negotiate timeouts?) was introduced.

Dizzybacon commented 2 weeks ago

Any success with this? I have the same issue on 1.40.2 downgraded to 1.35.1 and it's still very flaky. Nothing stands out in the logs, just MQTT time out.

camlin commented 2 weeks ago

Any success with this? I have the same issue on 1.40.2 downgraded to 1.35.1 and it's still very flaky. Nothing stands out in the logs, just MQTT time out.

Interesting, I have had no timeout issues with 1.35.1. In fact, I just checked and my MQTT server is reporting that the connection has been up for 109 hours, which matches the last time I restarted the client.

Now it could well depend on the MQTT server and its setup but, if that's really the case, then it will be a harder problem to solve.

p4r4ll4x commented 2 weeks ago

I also investigated this issue further and was able to resolve it by moving to a Proxmox host with less load. Although this server had enough free resources, i don't really know what the issue was.

It seems that after the update the connection to MQTT is less stable and is easily influenced by a higher load. It could also be that after the update Z2M needs more resources, as the web interface also showed hickups and slow loading. I didn't had the issue before and the load on this server hasn't really changed.

Also TCP reponse times to my MQTT server/service that was on the same Proxmox host was always around 1ms, also during the issues with Z2M.

As we have now 4 users reporting this behaviour, i don't think it has to do with my setup specifically.

camlin commented 2 weeks ago

Hmmm this one is getting vicious and I am honestly starting to wonder if it is even related to the Z2M itself.

On my "production" Z2M (a Jetway embedded PC) it works with 1.35.1 (but fails with 1.40.2).

On my "test" Z2M (a local run of the Z2M source code from Git using node) it fails with 1.35.1 too...

Now I did a Wireshark (TCP packet capture) and I noticed that the ping request was working fine with IPv4 clients but was failing when the clients were using IPv6... And from what I can see it is a bug with my MQTT Broker (Firebrick FB2900 firmware V2.01.019): Its does respond to the ping but with an empty packet (just an ACK), which is incorrect. With IPv4 clients I can see the proper MQTT PingResp message.

So, at present, and for me, this is an issue with my broker and not Z2M... (and only if the both the server and client happens to be using IPv6).

I suspect that's the core of the issue here: it is likely a mix of multiple problems / env setup constraints. It is still possible that the Z2M version could play a role (i.e. does 1.35.1 prefers IPv4 when later versions prefers IPv6? Running it in a docker or barebones can also affect this as dockers are often set as IPv4 only).

But, in the end, I have to say my personal feeling is that Z2M is probably not at all at fault here (and what I could see of its behaviour on the packet capture was impeccable).

MPowerData commented 2 weeks ago

Anyone found a solution? I am experiencing the same issue after rebooting my host system...

camlin commented 2 weeks ago

Anyone found a solution? I am experiencing the same issue after rebooting my host system...

I found a solution: if I switch my MQTT server to Mosquitto then, in my case, the problems disappear, even when using 1.40.2...

So, I would tend to think that, here, the problems are more related to "what is the MQTT server?"...

I am not denying that 1.40.2 seems to make it worse, but it could just be an issue of the packets being bigger (or something like it) and bringing the MQTT server to the tipping point?

I know in my case I was using an embedded MQTT server (Provided by a Firebrick FB 2900), and I know that server is very aggressive in trying to keep the MQTT network under control (for example it will discard messages that have payloads bigger than 3k).

But I know that, as soon as I switched to Mosquitto (i.e. the only change I did was to edit the server: mqtt://<server> in the Z2M config file) then it immediately started to work without any timeout issues...

So I think we need to start listing the MQTT Server being used and find the common point that is causing them to misbehave.

camlin commented 2 weeks ago

Anyone found a solution? I am experiencing the same issue after rebooting my host system...

I found a solution: if I switch my MQTT server to Mosquitto then, in my case, the problems disappear, even when using 1.40.2...

So, I would tend to think that, here, the problems are more related to "what is the MQTT server?"...

I am not denying that 1.40.2 seems to make it worse, but it could just be an issue of the packets being bigger (or something like it) and bringing the MQTT server to the tipping point?

I know in my case I was using an embedded MQTT server (Provided by a Firebrick FB 2900), and I know that server is very aggressive in trying to keep the MQTT network under control (for example it will discard messages that have payloads bigger than 3k).

But I know that, as soon as I switched to Mosquitto (i.e. the only change I did was to edit the server: mqtt://<server> in the Z2M config file) then it immediately started to work without any timeout issues...

So I think we need to start listing the MQTT Server being used and find the common point that is causing them to misbehave.

And, yes, in the end, it seemed to have little to do with IPv4 vs IPv6. I did manage to get the issue with IPv4 too. Again could be an issue of difference of packet size being the tipping point.

But that should not have been a problem. Which is why that points to the MQTT server being at fault, which seems to be confirmed in my case as, as soon as I switched to Mosquitto, it magically all worked again...

MPowerData commented 2 weeks ago

Thanks for the update. I am already running Mosquitto for a long time without any issues. Both z2m and mosquitto run on a docker container. This has been stable for years... Strange.

camlin commented 2 weeks ago

BTW, @p4r4ll4x I am still trying to reconcile your scenario and mine. You can confirm you are still using Mosquitto as your MQTT server? (and only that?).

camlin commented 2 weeks ago

Thanks for the update. I am already running Mosquitto for a long time without any issues. Both z2m and mosquitto run on a docker container. This has been stable for years... Strange.

Ok, thanks for confirming @MPowerData . It is probably one of these crazy problems where there are several issues overlapping each others 😩

On my side I am trying to contact the maker of my MQTT server and check with them what they think of the bug.

And I have also switched to Mosquitto for the time being, so I will keep an eye to see if it starts to misbehave too.

p4r4ll4x commented 2 weeks ago

BTW, @p4r4ll4x I am still trying to reconcile your scenario and mine. You can confirm you are still using Mosquitto as your MQTT server? (and only that?).

I always used and still use the same Mosquitto server (HA Add-on 6.4.1 running Mosquitto 2.0.18) I can's say when this add-on was updated as i don't see the backup file anymore (cleaned up).

MPowerData commented 2 weeks ago

@Koenkk , any ideas on this issue?

billyinferno commented 2 weeks ago

Found same issue after upgrade to 1.40.2, and I already using Mosquitto for my MQTT services. Both running in docker, and now if I go refresh HA, it will showed unavailable for a moment, until the MQTT connection is restarted by Z2M.

camlin commented 1 week ago

I would be curious. What was the last version that worked before people had this problem? Me it was 1.35.1 but that's only because I went straight from 1.35.1 to 1.40.2...

I also manually tried 1.40.0 and it did have the problem (for me at least). So it probably would be interesting to find out the version that last worked for people (would hopefully help narrow the problem down).

In my case I am only having the issue when using a Firebrick FB2900 as my MQTT broker (with 1.40.x. It worked fine with 1.35.1).

But strangely enough I have had no problems when I use Mosquitto as my MQTT broker (even with 1.40.x). But some of you have (and I think most of them have the Mosquitto broker configured by HA). I am using the default Mosquitto (no-auth) config provided with the Mosquitto Docker but I am starting to suspect some config parameters are perhaps triggering that problem.

I have been trying to compare packet dumps when using 1.35.1 and when 1.40.2 and, as far as I can tell, one of the big difference is that the payloads are a lot bigger. In particular with 1.40.2 there is a bridge/definitions message that seems quite big (120KB in my case) and it does not seem to be sent with 1.35.1.

Basically I am suspecting/wondering that recent Z2M payloads have changed (as part of its normal evolution) but that something is tripping some broker depending of their config...

Koenkk commented 1 week ago

What is the last working z2m version? Could be due to a dependency upgrade (https://github.com/mqttjs/MQTT.js)

Koenkk / zigbee2mqtt