[Discussion] Improve alerting for browser monitors

Problem

Browser monitors are sufficiently different than lightweight checks, leading to unintended bugs in the alerting framework.

History

We have received a handful of SDH's for browser monitor alerts. Some issues include:

Browser alerts rules triggering before the specified amount of down monitors: https://github.com/elastic/kibana/issues/115928
Browser alerts flapping between triggered and resolved state. https://github.com/elastic/support-known-issues/issues/980

Fixes

Fixes have gone in to improve the experience, including

https://github.com/elastic/support-known-issues/issues/980

Outstanding Issues

1. Gaps in `monitor.timespan` leading to flapping alerts

Browser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.

Potential solutions:

Update the monitor.timespan to factor in the length of the synthetic check: https://github.com/elastic/beats/issues/29102 Our current alerting rules rely on monitor.timespan by looking back in history to see if there is a down check within the monitor.timespan range. By increasing the monitor.timespan to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.
Explicitly look for up monitors Explicitly looking for up monitors has come up a few times as the most accurate way of telling if a monitor is resolved. However, this logic adds significant complexity into the existing alerting architecture. A PR was constructed to achieve this goal, but it was later determined that querying by monitor.timespan would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: https://github.com/elastic/kibana/pull/100339

From the above potential solutions:

Solution 1: Solution 1 is not viable, as there is still the potential to have gaps in the timeline for suite monitors, which are run in serial and have the potential to delay subsequent journeys in significant ways when journeys are long-running
Solution 2: @dominiqueclarke to investigate this solution as part of https://github.com/elastic/kibana/issues/121330

Additional suggestions:

Support timeout for browser monitors: Having a timeout less than the schedule monitor interval can prevent gaps in the timeline. @andrewvc to investigate in https://github.com/elastic/beats/issues/29454
Check for incomplete monitor status by searching for the absence of a heartbeat/summary document before resolving the alert. @dominiqueclarke to investigate in https://github.com/elastic/kibana/issues/121330

elastic / uptime