custom-components / remote_homeassistant

Links multiple home-assistant instances together
Apache License 2.0
907 stars 81 forks source link

Intermittent "heartbeat failed" errors #292

Open larry-glz opened 3 months ago

larry-glz commented 3 months ago

hi,

i have a master and slave connected via VPN, both on 4.1. The VPN is fairly solid, with the occasionally ping or packet loss.

recently i've experienced heartbeat failures:

This error originated from a custom integration.

Logger: custom_components.remote_homeassistant
Source: custom_components/remote_homeassistant/__init__.py:497
integration: Remote Home-Assistant (documentation, issues)
First occurred: June 25, 2024 at 8:11:10 PM (327 occurrences)
Last logged: 10:29:48 AM

heartbeat failed

i want to tie this to HA 2024.6.x update from 2024.5.x, but i'm not entirely sure. When the heartbeat fails, it tends to slow down the master HA instance until it recovers. the best way i know it's happening is the HA App will hang/crash when the heartbeat fails. So i wonder:

Thanks,

jaym25 commented 3 months ago

@larry-glz Sounds like intermittent internet/vpn connection drops. Heartbeat failure is common unless your connection is extremely solid. I get heartbeat failed errors occasionally and just ignore them if the system is working. I've never seen the hang/crash issue you're experiencing even when the remote instance is offline. There also could be some issues with this component and VPN setups. We will need some help to tackle that issue if this is the case... Check out issue - Remote Home Assistant over VPN #283

Heartbeat failed could be possibly be changed to a warning and we may do that in a future release.

larry-glz commented 3 months ago

This is a hunch, but it seems like the HA app is interfering somehow. Any time I detect the heartbeat failures (by checking logs), it’s when I just used the HA app (iOS). Is there any log I can capture? I wonder if the websocket changes made to the app a couple of releases ago (2024.4) are contributing to the failures or conflicting w this component? Or maybe the HA app crashing is a symptom?

jaym25 commented 3 months ago

@larry-glz @lukas-hetzenecker I'm on the latest HA Release 2024.6.4 and the Android HA app and not experiencing these problems at all. It sounds to me like the IOS HA App is causing the slowdowns and the RHA heartbeat failures are a symptom of that independent problem. No idea what logs would help except changing RHA logs to debug in the logger integration. Checking HA for IOS app issues would be another avenue. From what you're saying, I'm thinking it's possible you could disable this component and have the same problems minus the heartbeat warnings. If not, we can only hope somebody reading this can help us out, because this is way beyond my knowledge base... If you look at the recent closed pull requests, I jumped in to try to save this component because I enjoy it so much and had it working so well for myself, but I'm not the original developer and not that well versed on some of these issues. Keep this thread updated with anything you find in case somebody reading it can give us a hand...

larry-glz commented 3 months ago

i'll research if anyone is having HA App issues similar to what i'm seeing. thanks for picking up the project btw. i didn't want it to be mothballed...

dinan5 commented 2 months ago

I've been having these issues too and I've been trying to isolate the cause. I am wondering if certain errors from components on the remote node could be "pausing" the remote instance and therefore not responding to the ping from the RHA client? As an example I've seen some correlation between timeout errors on both Eufy and Orbit B-Hyve. Very frequently, when I get Heartbeat Failed on the RHA client, if I look on the remote node I see errors from one of these components right beforehand. Any chance there is some error recovery code in HA that is single threaded?

Update - I've deleted Eufy from all instances except the remote instance and I am using RHA to obtain status on the Eufy device from the remote instance. So far, no more errors. More to come....

Next update - been running clean all day with one exception, when I received the following error message on my remote instance: Error from stream worker: Error demuxing stream (ERRORTYPE_5, I/O error, rtsp://:@xx.xx.,xxxx.xxx:xxx/Streaming/Channels/xxx)

... at which time I also got a Heartbeat Failed message. I'm starting to really think there is something single threaded in error handling in HA.

jaym25 commented 2 months ago

Heartbeat failure has been changed to a Warning. Heartbeat failures warn you of problems communicating with the Remote HA for whatever reason. Some people may not want to see these failures and now they can use logger to limit logged items, from this component, to errors or greater. This way they will not see any heartbeat failures while still being notified of errors.

dinan5 commented 2 months ago

Thank you!