Open dominiqueclarke opened 2 years ago
From the above potential solutions:
Additional suggestions:
timeout
for browser monitors: Having a timeout less than the schedule monitor interval can prevent gaps in the timeline. @andrewvc to investigate in https://github.com/elastic/beats/issues/29454heartbeat/summary
document before resolving the alert. @dominiqueclarke to investigate in https://github.com/elastic/kibana/issues/121330
Problem
Browser monitors are sufficiently different than lightweight checks, leading to unintended bugs in the alerting framework.
History
We have received a handful of SDH's for browser monitor alerts. Some issues include:
Fixes
Fixes have gone in to improve the experience, including
Outstanding Issues
1. Gaps in
monitor.timespan
leading to flapping alertsBrowser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.
Potential solutions:
monitor.timespan
to factor in the length of the synthetic check: https://github.com/elastic/beats/issues/29102 Our current alerting rules rely onmonitor.timespan
by looking back in history to see if there is a down check within themonitor.timespan
range. By increasing themonitor.timespan
to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.monitor.timespan
would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: https://github.com/elastic/kibana/pull/100339