OpenSLO / slogen

tool to create and manage content for reliability tracking from logs/event data.
Apache License 2.0
79 stars 6 forks source link

SLO Burn rate monitoring is incorrect #41

Closed lswith closed 2 years ago

lswith commented 2 years ago

Hi team, I've been using your tool extensively and I am loving it!

I have come across an issue with the monitoring of an SLO.

My current alerting configuration is as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: xxx
  name: xxx
spec:
  service: xxx
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            xxx
        good: 
          source: sumologic
          queryType: Logs
          query: 'xxx'
        incremental: true
      displayName: xxx
      target: 0.99
alerts:
  burnRate:
    - shortWindow: '10m'
      shortLimit: 14
      longWindow: '1h'
      longLimit: 14
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '30m'
      shortLimit: 6
      longWindow: '6h'
      longLimit: 6
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '6h'
      shortLimit: 1
      longWindow: '24h'
      longLimit: 1
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning

When I evaluate the SLO over a 24h period, it is currently at 98.41897 (which is below the 99 to meet the SLO).

I would have expected that I would receive at least 1 email stating that this SLO is not being met, however all the monitors generated aren't being triggered.

I'm wondering if the calculation of one of these items may be incorrect?


Current version: There is no slogen command to output the version, but I'm pointing to the latest of the main branch.

lswith commented 2 years ago

I'm wondering what happens when tmCount is 0 on this line: https://github.com/SumoLogic-Labs/slogen/blob/59d58d9c9ac440a88675755e333d1b800f9bde43/libs/monitor.go#L66

I think it should be guarded against.

lswith commented 2 years ago

Looking at my specific monitoring query it seems that the .Budget isn't being filled with 0.9 but is 0 instead.

_view=xxx
| timeslice 6h 
| sum(sliceGoodCount) as tmGood, sum(sliceTotalCount) as tmCount  group by _timeslice
| fillmissing timeslice(1m)
| tmGood/tmCount as tmSLO 
| (tmCount-tmGood) as tmBad 
| total tmCount as totalCount  
--> HERE --> | totalCount*(1-0) as errorBudget
--> HERE --> | ((tmBad/tmCount)/(1-0)) as sliceBurnRate
| if(queryEndTime() - _timeslice <= 6h,sliceBurnRate, 0  )  as latestBurnRate 
| sum(tmGood) as totalGood, max(totalCount) as totalCount, max(latestBurnRate) as latestBurnRate 
| (1-(totalGood/totalCount))/(1-0) as longBurnRate
| if (longBurnRate > 1 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 1, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn

This would actually fix the query and I can see that the the burn rate would exceed the value of 1.

agaurav commented 2 years ago

Hi team, I've been using your tool extensively and I am loving it!

thnx @lswith, its very encouraging for us to know it being useful.

I'm wondering what happens when tmCount is 0 on this line:

https://github.com/SumoLogic-Labs/slogen/blob/59d58d9c9ac440a88675755e333d1b800f9bde43/libs/monitor.go#L66

I think it should be guarded against.

Looking at my specific monitoring query it seems that the .Budget isn't being filled with 0.9 but is 0 instead.

_view=xxx
| timeslice 6h 
| sum(sliceGoodCount) as tmGood, sum(sliceTotalCount) as tmCount  group by _timeslice
| fillmissing timeslice(1m)
| tmGood/tmCount as tmSLO 
| (tmCount-tmGood) as tmBad 
| total tmCount as totalCount  
--> HERE --> | totalCount*(1-0) as errorBudget
--> HERE --> | ((tmBad/tmCount)/(1-0)) as sliceBurnRate
| if(queryEndTime() - _timeslice <= 6h,sliceBurnRate, 0  )  as latestBurnRate 
| sum(tmGood) as totalGood, max(totalCount) as totalCount, max(latestBurnRate) as latestBurnRate 
| (1-(totalGood/totalCount))/(1-0) as longBurnRate
| if (longBurnRate > 1 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 1, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn

This would actually fix the query and I can see that the the burn rate would exceed the value of 1.

great catch on both, will fix and create a new release after testing them out by end of the day.

agaurav commented 2 years ago

hey @lswith, made an attempt to fix both .Budget not being set and tmCount being 0 in v0.7.10 and tested it for a few configs.

please let me know if you still face the issue after upgrading to the new version. And mega thnx for reporting this critical bug and the cause along with it :)

lswith commented 2 years ago

Just confirmed. This fixed the issues with alerting! Thanks again