autometrics-dev / am

Autometrics Companion CLI app
Apache License 2.0
16 stars 2 forks source link

Handle custom histogram buckets #126

Open keturiosakys opened 1 year ago

keturiosakys commented 1 year ago

Some of the Autometrics libraries (Rust, Go) already support customizing the histogram buckets and others are likely to follow suit. One implication of this is that if the library produces metrics with different than default histogram buckets, the Autometrics alerting ruleset won't work. To make it easy to handle these cases am could be able to generate and apply these custom rulesets.

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood. Alternatively, an easier path to implement would be a flag + am.toml field that would allow to set these manually

hatchan commented 1 year ago

This can be picked up together with https://github.com/autometrics-dev/am/issues/22

gagbo commented 1 year ago

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood.

I think runtime detection is safer, because sometimes you won't be able to infer the buckets from the code (if an environment variable controls the buckets).

Also, technically the alerting ruleset will still work if you only change the buckets of the histogram, it will just be less accurate (meaning the alert might not trigger at exactly 90% or whatever SLO success rate); what will break completely the alerting rules (as "no alert would trigger at all"), is to change the percentile objective to something unsupported (so something outside 90%, 95%, 99%, and 99.9% iirc). It's still something we want to tackle for sure anyway!

IvanMerrill commented 1 year ago

I would strongly suggest that there needs to be a way to enforce a rule for there to be a bucket boundary on the SLO latency target. The loss of accuracy that @gagbo describes could be a big deal. For example:

SLO target is 95% at <= 150ms A bucket exists for >100ms & <200ms All responses take 110ms The 95% percentile would be calculated as 195ms, and an alert would trigger.

Without enforcing the boundaries to match the SLO target it's easy for people to create a situation where their well-behaved code would permanently trigger an alert.