FR: Add Diagnostic Entities on Scanners

agittins commented 4 days ago

From message from Lash-L — Today at 2:13 PM

Thoughts on adding a device per each scanner and then adding some entities to it? I see on a lot of issues you do a lot of debugging diagnostics. What if some of those attributes became easier for users to see and parse. i.e. "Average update interval"

I'm dead keen on this, but haven't worked out how/what exactly yet, so let's try now.... DRAFT FOR DISCUSSION

These would all probably be on a 1 minute or longer update cycle.

[ ] Proxy Avg Update Interval (or peak avg update interval?)
- answers: What's the best this proxy can do at catching every ad from a device?
- interpretation: 0-2 Excellent, 1-2 Great, 3 Moderate
- search all (recent) hist_intervals for the given proxy, and find the one with the lowest mean().
- report the mean/min/max of that interval set.
- this works by identifying the device that makes this scanner look the best it can, avoiding benchmarking against devices with long intervals or lossy signal paths.
- this is "aliased" by Bermuda's ~1-second update rate, so a proxy can't do better than that.
[ ] Proxy Reporting Stats
- How well does the proxy forward ads to HA? Do we always have fresh data from the proxy every second?
- Replace the stale_updates count, to instead feed a hist_interval_updates[ fresh, stale] list. We increment fresh or stale on each update cycle, depending on whether the proxy has given us any new data. If update is fresh and stale count is not zero, we first insert a new tuple in the list. Then our list contains pairs of contiguous fresh/stale update counts.
- Entities:
- [ ] Proxy Update Loss (%) = 1 - (sum(fresh) / sum(fresh+stale))
  - how often this proxy fails to provide fresh data. Esphome should be 0%. Shelly should be 33% or 0% depending on whether it's rate-limiting is synchronous across all devices or not.
- [ ] Proxy Avg Outage Duration (s) = mean(stale)
- [ ] Proxy Avg Outage Frequency (Hz) = mean(fresh) / sum(fresh+stale)
  - multiply by seconds in a day for outages/day

I think we could do something very similar to the proxy stats for devices. It's a little trickier in that "outages" are legitimate for devices, because sometimes they leave home, while proxies aren't expected to. But by trimming the lists based on keeping sum(fresh)+sum(stale) below a certain time limit, we get a good "recent stats as of now" measurement, and HA's history of that entity shows it's variation over time - so you'd see your phone performing well, but then doing "poorly" for a few hours because you were out at work, etc.

So for devices, we check if any proxy has a fresh update for us and update fresh/stale accordingly.

Something to keep in mind is that proxy entries in the devices{} dict will also be metadevices in future.

agittins commented 4 days ago

It would also be great if the diagnostics could include an n-most-recent list of each of these sensors too.

Either by storing them into something in Bermuda or by having the diagnostics do a query against recorder for the data.

The goal here is that someone can work out their own issues between the exposed entities and the docs, or they can share screenshots, or they can upload a diagnostics.

At some point I'd like to write a bot/workflow for issues that will parse out useful information when a diags is included in the ticket, which should make triage quicker. (Raised #365 )

Lash-L commented 4 days ago

What do you think about a binary sensor to go along with this data? Either one to encompass all of it or one or each entity.

Basically healthy or unhealthy.

That way the user can immediately see - Oh something is wrong with this proxy, let me dive in and figure out what.

Maybe also a FAQ for each entity on troubleshooting steps. i.e. If Proxy Avg Update Interval > 3 then try:

1. 3., etc.

agittins commented 4 days ago

Yeah, great idea! I was thinking about using the "Repairs" feature, which allows using URIs to provide solutions. I don't know how annoying that might get for this sort of thing, but we could have a Button entity to "Check for Repairs" so that it only created Repairs when the user asks for help, perhaps prompted by the Health indicator going off.

But yes a simple binary that makes an easy automation target might be a great idea.

agittins / bermuda

FR: Add Diagnostic Entities on Scanners #363