Open 389-ds-bot opened 4 years ago
Comment from firstyear (@Firstyear) at 2020-05-07 05:17:19
I think there is a server uptime variable in cn=monitor we could read, and if that's less than 30 mins we can say "this may not yet be accurate" or similar?
I'd probably say 90%, 80% are the numbers for green/amber? But it's hard to know what's right here, there are many factors ....
Comment from firstyear (@Firstyear) at 2020-05-07 05:17:20
Metadata Update from @Firstyear:
Comment from mreynolds (@mreynolds389) at 2020-05-07 17:55:07
Metadata Update from @mreynolds389:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/51071
Issue Description
With the existence of autotuning, many Admins are not checking if the caches are optimally tuned. Autotuning provides a much better minimum default for the cache sizes, but it is not fully optimized. The server itself can not do this as it doesn't know how the system is being used, etc. So an admin needs to take manual action and adjust the sizes based on actual availability of resources. Adding a "performance check" into healthcheck would be beneficial. This check would just look at the various cache hit ratios and report warnings based on these values. For an example, an cache hit ratio less than 80% should report a warning (something like that).
The challenge is that when you first start the server is that the ratios are at zero. We really should only check the cache hit ratios once the server has been up and running and/or the caches are fully primed. All this information is available in our monitors (cache stats, server uptime, etc), but when do we say it's okay to check the ratios? After 1 hour, 6 hours? Or when the entry caches are filled? This might not be so straightforward. My point is that we need to reduce the risk of a false positive if we add this type of health check to the tool.
The other issue is deciding what cache hit ratio percentages should generate warnings. For example:
95% or higher = Green, no warning 85 - 95% = Amber < 85% = Red
This is a bit on the high end, but what percentage should trigger a warning? This should be discussed among the team.