OpenSLO / slogen

tool to create and manage content for reliability tracking from logs/event data.
Apache License 2.0
79 stars 6 forks source link

Burnrate alerts aren't working correctly #47

Open lswith opened 2 years ago

lswith commented 2 years ago

I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.

When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.

I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/

Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.

lswith commented 2 years ago

Also, I have 3 alerts associated with an SLO: 10m-1h, 30m-6h, and 6h-24h. In prometheus, the alerts aren't duplicated because they're grouped together (as you can see in the query in the article), but in Sumo I got 3 emails per SLO while the system was down.

lswith commented 2 years ago

Looking into this a bit more thoroughly, it looks like the monitor is being evaluated over the long period, and if the combined_burn exceeds the value of 1, anytime in that period it won't resolve. This would mean that it would have to be 1 or lower, for the long period of time.

I think we might have to change the monitor to be evaluated over the short period of time, but move the calculations for the combined_burn into a scheduled search so that it can be evaluated over a period of time.

lswith commented 2 years ago

It looks like a scheduled search wouldn't do it, but a scheduled view would. You can pre-populate the scheduled view with the current longBurnRate, and then calculate the latestBurnRate in the monitor.

Also, I've noticed that I am using the trigger for "Warning" and "ResolvedWarning" which is tripped when the combined_burn exceeds 1. The "Critical" and "ResolvedCritical" seem to trip when the combined_burn exceeds 2 but this will never happen, as it can only equal 2:

if (longBurnRate > 6 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 6, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn
lswith commented 2 years ago

Also, looking into the https://sre.google/workbook/alerting-on-slos/ more, it seems that they combine alerts based on the notification type.

For example:

expr: (
        job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
      and
        job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)
      )
    or
      (
        job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
      and
        job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001)
      )
severity: page

This query means that both SLO alerts are combined. If either one is triggered, it will send the same email. This has the benefit that there won't be 2 notifications that the alert has been triggered, and there won't be a duplication of alerts.

I think it might be worthwhile updating the SLO configuration to the latest OpenSLO Spec. They have added a few objects such as "AlertPolicies" which have 1 or more "Alert Conditions". This would allow the configuration to group all of the "long/short burn rate" conditions into 1 alert.

lswith commented 2 years ago

Ah dam, it looks like OpenSLO oslo doesn't support the latest OpenSLO Spec.

https://github.com/OpenSLO/oslo/issues/63

agaurav commented 2 years ago

hey @lswith, i will discuss the monitor not resolving with monitors team and get back on it by tomorrow. I recall it was to prevent frequent flapping b/w alert opening and closing but waiting for 6h defeats the purpose of a multi-window monitor.

the update to oslo is currently blocked for two reasons : 1) they haven't updated oslo and 2) it doesn't support multi burn rate monitors yet. i will discuss this with openslo team and will try to expedite it with raising a pr for oslo.

agaurav commented 2 years ago

the monitor team is working on adding configurable resolution window for monitors, after that setting the resolve window to the short-burn period will give us the correct behaviour required for these alerts. The ETA for this feature is end of march.

cc: @tarunk2