custom-components / healthchecksio

Update and display the status of your healthchecks.io checks.
MIT License
48 stars 9 forks source link

Sensors report false Disconnected state #37

Closed Mincka closed 1 year ago

Mincka commented 1 year ago

Describe the bug

The sensors often report "Disconnected" state and come back 5 minutes later. They seem to skip a ping. However, there's no skipped ping or grace period.

Here's a comparison between this manual config (with 300 seconds for update interval) and this integration: image

image

image

Maybe the timeout set to 10 seconds is a little bit too small from time to time.

Still, I don't get why the result would be different from HA vs others. It's mainly concerning HA in my case.

Debug log

I enabled debug logging in the integration but I did not see anything relevant.

ludeeus commented 1 year ago

If 10s is not enough, you have something blocking on your instance. Based on your result, probably something that runs every 10min

Mincka commented 1 year ago

The timeout was just a guess but that may not be related at all. How to explain that a simple REST sensor with the same 300 seconds interval never reports erroneous values?

I had a look at the code and I think you are doing the same as the REST sensor, you get the results for all probes in the checks attribute for the response. So if there's only one request that would fail, the Disconnected status would affect all probes and not HA or Proxmox only.

I checked in the HA logbook and logs and there's nothing that could relate to something blocked or disconnected.

There's not regular time pattern. It happened 5 times from midnight to 9h00 and nothing for the rest of the day.

image

Mincka commented 1 year ago

Ok, I have something interesting. image For the two events in red, I see a shift of one minute but no "grace".

Exact logs: 2023-03-07T16:50:38.794827+00:00 2023-03-07T16:56:08.757711+00:00

2023-03-07T17:41:10.234040+00:00 2023-03-07T17:46:40.158711+00:00

2023-03-07T18:51:40.307883+00:00 2023-03-07T18:57:10.323993+00:00

For some reason, the ping is exactly 30 seconds late each time.

So I think the grace period is activated but not seen by my other tracker since it rarely asks exactly during the 30 seconds of grace.

Any idea why the component is updated with a delay of 30 seconds from time to time? It seems random. Not so important anyway since there's the grace period. I have an interval of 5 minutes and a grace period of 5 minutes, so the binary_sensor should not consider it "Disconnected" while we are in the grace period.

Since it's a binary_sensor, we need to choose between considering the service up while we are in the grace period. I suggest to change the logic and consider it "Disconnected" when != "up" or "Connected" == "up" or "grace". https://github.com/custom-components/healthchecksio/blob/main/custom_components/healthchecksio/binary_sensor.py#L49

I am testing the second option and report it back here. Anyway, if you don't agree it should be changed and you won't accept a PR, you can close this issue. Thank you.

hugalafutro commented 1 year ago

I have been suffering the same issue occasionally on some of my checks without any recurring traceable pattern, it would be ok for days, then report offline every 5minutes for half a day randomly.

I implemented the PR @Mincka made and can confirm it fixes the issue.