Open KarlZ opened 2 years ago
Headline: SetEvaluationTimeInSeconds
does appear to control both the polling on every server (even staging slots) and the page refresh.
Details: But in reality, it controls the polling and the page refresh "goes along for the ride". If you refresh the page, it will just show the last poll result. We have too many "readiness" (Configuration) health checks to be doing that all the time. We did split out the "liveness" checks on have a very small number of those that execute in <50ms.
I do still wish that I could just enjoy the nice UI (1 in the original comment) without having to use the history. I would like it if that healthchecks-ui
page could be configured to just call healthz
ONLY when I visit that page.
For now, we've removed HealthCheckUI.
Agreed, out of the box, it does too much. Everything it does, it does very well. But it would be nice to be able to (via config settings), choose between real-time and as-needed monitoring, and data capture vs persistence.
Agreed, out of the box, it does too much. Everything it does, it does very well. But it would be nice to be able to (via config settings), choose between real-time and as-needed monitoring, and data capture vs persistence.
I totally agree with this assessment. Our servers are heavily impacted by the history polling, since the executed query is something like this....
exec sp_executesql N'SELECT [t].[Id], [t].[DiscoveryService], [t].[LastExecuted], [t].[Name], [t].[OnStateFrom], [t].[Status], [t].[Uri], [h].[Id], [h].[Description], [h].[HealthCheckExecutionId], [h].[Name], [h].[On], [h].[Status], [h0].[Id], [h0].[Description], [h0].[Duration], [h0].[HealthCheckExecutionId], [h0].[Name], [h0].[Status], [h0].[Tags]
FROM (
SELECT TOP(2) [e].[Id], [e].[DiscoveryService], [e].[LastExecuted], [e].[Name], [e].[OnStateFrom], [e].[Status], [e].[Uri]
FROM [health].[Executions] AS [e]
WHERE [e].[Name] = @__configuration_Name_0
) AS [t]
LEFT JOIN [health].[HealthCheckExecutionHistories] AS [h] ON [t].[Id] = [h].[HealthCheckExecutionId]
LEFT JOIN [health].[HealthCheckExecutionEntries] AS [h0] ON [t].[Id] = [h0].[HealthCheckExecutionId]
ORDER BY [t].[Id], [h].[Id], [h0].[Id]',N'@__configuration_Name_0 nvarchar(500)',@__configuration_Name_0=N'Services'
... and it loads every time (30 seconds) tens of thousands of records in RAM even when the UI is not used by anybody. IMHO this is pretty problematic, since we are generating a lot of workload (on the DB and into the RAM of the server) for nothing. The UI is an amazing feature, but this limitation is preventing us to use it, and unfortunately we had to disable it.
+1 this is a cross join sql. eg: when you've 10K history table and 15 records in the other table, it fetches 150K data every 30 seconds per user that viewing status page
Headline: Is there a way that I can prevent the polling taking place that is used for recording history? I really just want the nice UI (which is great here!)? But now there is a lot more traffic on the site as this polling takes place.
Details: It seems like there are two user stories that appear to be tightly coupled:
healthz
.Healthy
toUnhealthy
and visa versa so I understand my Uptime.I have other systems that are monitor the history by hitting the
healthz
page. If something is wrong, our team gets an email. I then want to go to thehealthchecks-ui
page to as a human I can easily read the status. I was quite surprised then the request count went up so much as a result of this. I have 6 servers, each with two slots. So now I have 12 slots making requests that are being recorded in a single Application Insights instance. Our monitoring systems monitor the production slots and we're not concerned with the -staging slots. But I would actually like to turn off that polling on all sites and leave the history to my other systems. Using the.SetEvaluationTimeInSeconds
does not impact this polling but rather only seems to affect the polling when a user in on thehealthchecks-ui
page. It seems that about every 30 seconds each slot will poll to record the history if there is a change.I realize that I lose the advantage of knowing the history on a more granular level (each individual HealthCheck vs. the rollup). But our systems on Azure are rarely down.