catalyst / moodle-tool_heartbeat

Moodle health checks for load balancers / nagios
https://moodle.org/plugins/tool_heartbeat
23 stars 29 forks source link

Exempting specific checks #169

Open brendanheywood opened 7 months ago

brendanheywood commented 7 months ago

We want to create the ability to add extra config in heartbeat which applies on top of all the status checks and if an override is in place then it:

brendanheywood commented 7 months ago

had a chat with @matthewhilton and I think the cleanest design is that the overrides only ever happen at the check level which will make the logic of handling the overrides much easier. But this kinda moves the problem in that we still have some checks like 'are there any slow tasks' which are really checking a whole bunch of things and when they fail it could fail on any type of task.

So the solution I have in mind for this is that various checks (only the ones in heartbeat) actually conditionally declare multiple checks for each class of issue. So lets say that a site is green and there is 100 tasks and they are all good, then there is 1 check and it is green.

Now lets say that 3 types of task start to fail, then we will see the one original check which says 97 tasks are good, and then 3 new extra checks which say 'task foo is broken', 'task bar is broken', 'task blah is broken' and now we can address each of them in turn individually. In other words the main check will never actually fail it will only spawn failing tasks. It also means the logic of looking for failing tasks needs be move back into lib.php (or called from there) rather than inside the result object. A little weird but I think its ok for this situation as its a fast query and it is only moving the perf hit to a bit earlier.

brendanheywood commented 7 months ago

One more thing, if we mark a failing check as muted for a month, and then after one month that check is actually resolved, either the check is no longer declared or the check is declared and is passing, then I think we should explicitly mark the override as having been resolved. If the check is still failing then it is shown as overdue and it keeps alerting and someone will probably extend it again and / or resolve it properly. We want a full audit trail of who added overrides if the same check fails intermittently over time.

owenherbert commented 7 months ago

I've created a consolidated README.md file on what these changes would look like, let's discuss it on Monday.