eclipse / microprofile-metrics

microprofile-metrics
Apache License 2.0
100 stars 66 forks source link

MP Config properties for Histogram and Timers to extend and customize output #779

Closed Channyboy closed 10 months ago

Channyboy commented 12 months ago

Fixes #587, #674, #675, #676, #691

Channyboy commented 11 months ago

@tjquinno

After discussion with the liberty team. We're proposing to change the property mp.metrics.distribution.timer.slo to mp.metrics.distribution.timer. This removes the slo as it is unneeded as the timer metric can be used for more than just SLO calculations.

Also the mp.metrics.distribution.timer.maximum-expected-value and mp.metrics.distribution.timer.maximum-expected-value and the equivalent for the histogram property we want to reduce it to just:

Further more, the mp.metrics.distribution.timer.slo/ mp.metrics.distribution.timer and the mp.metrics.distribution.(max/min)-value properties we had defined that it would require a time unit (i.e. ms, s, m, h). Otherwise we would ignore it. We would like to provide a default unit of seconds if no time unit was specified.

tjquinno commented 11 months ago

@Channyboy

Thanks for tagging me on this.

A few thoughts...

  1. Regarding changing to min-value and max-value: seems reasonable--shorter with no real loss in clarity.
  2. Regarding defaulting durations/SLOs unit to seconds...

    There are other places in MP where time units default to something--for example, in fault tolerance @Timeout defaults to ms and in that case ms seems like a very natural choice given what's being timed: method invocations.

    With SLOs we have no idea in advance what a typical time range is going to be for any given timer because we don't know what sort of quantity any given timer measures (method invocations, atomic particle decay times, police and fire unit response times, government agency license application response times).

    I'm OK with--and perhaps in favor of--ignoring with a warning or even rejecting with an error any unit-less SLO setting. Requiring explicit units (and failing start-up in their absence) removes any ambiguity.

  3. For the config key for SLOs... I am inclined to keep the slo suffix. It keeps very explicit what those values correspond to which helps the config keys be more "self-documenting."

    Further, although we might not introduce them now, there could be other timer-related settings we'd want to be able to support via configuration in the future. For example, a Micrometer Timer supports the publishPercentiles method which accepts a list of values, as does slos.

    If, now in MP metrics config, we "up-level" the slo settings up to the timer level, then later we add config for publish percentiles, it could be somewhat ambiguous to users which settings the "top-level" values correspond to and it might seem to arbitrarily ascribe more importance to the SLOs than to the percentiles.

    Because slo is a short string, we're not imposing a lot more typing time to users who edit config files and we keep some clarity now and, potentially, going forward.

Channyboy commented 11 months ago

@tjquinno

  1. Regarding use of milliseconds as the default unit, that would make more sense. There is more control with milliseconds. Seconds was brought up as it was the same unit that will be used/converted-to in the Prometheus output. Regardless of which, I think the discussion brought up that we should have a default unit instead of ignoring values.

  2. We could also try mp.metrics.distribution.timer.buckets for a more generic sounding property?

**Not sure if the publishPercentiles was just used as a quick example. The mp.metrics.distribution.percentiles applies to both Histogram and Timer metric. So for any implementation using micrometer, this config property would affect a timer being built using the publishPercentile(..) method.

@donbourne wdyt?

Emily-Jiang commented 11 months ago

I would suggest to use mp.metrics.distribution.timer as it sounds more natural. Ending .buckets sounds weird to specify the value with a time unit.

donbourne commented 11 months ago

For (2), I'm in favor of having a default (s or ms). Assuming we have a default, the worst case scenario is that the person configuring the value thinks the default unit is different than what it actually is -- but since the metrics we generate show the buckets, that should be a fairly easy thing for the ops person to spot and fix. If we didn't have a default unit then the error handling would be more complex, both for the runtime (do we not start the app if the slo is configured in a file in the app? do we not start the server if it's configured in an env var?), and for the ops person (more ways to get something wrong that you have to correct). @Channyboy , which unit do the timer bucket values in the /metrics output have (our unit default should match, IMO)?

For (3), another argument for sticking with mp.metrics.distribution.timer.slo or mp.metrics.distribution.timer.buckets is that timers can have both percentile values and bucket time values, so it's not intuitive to just ask someone to fill in the values for mp.metrics.distribution.timer. I think I slightly prefer mp.metrics.distribution.timer.buckets over mp.metrics.distribution.timer.slo, since there's no guarantee that you're timing something for the purpose of tracking an SLO, and since it's very semantically related to mp.metrics.distribution.histogram.buckets.

Channyboy commented 11 months ago

The buckets for a Timer in Prometheus output will be converted to seconds.

For any incompatible values (either server or app level), I wouldn't think we would prevent the runtime from starting. We would just ignore it (with a warnning/error).

tjquinno commented 11 months ago

Re: 2. (default units)

I guess I will (reluctantly) step back from urging a start-up failure for unit-less settings!

Then the question is what's a good default unit. I mentioned the FT default not necessarily to advocate for that as the default for this usage but because, in the FT case, everyone knows that it is method invocations being measured and so ms seems like reasonable units for that. If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.

Re: 3. (config key suffix for bucket/slos)

The config key(s) must unambiguously identify what is being set.

IIUC (and I could well be wrong), the Micrometer API, as an example, allows setting bucket boundary values and also, via a separate method, SLOs.

The JavaDoc for Timer says (to me at least) that setting SLOs makes sure that those values are among those (perhaps along with others?) in the set of bucket boundaries.

My point is that we need to very clear exactly what we are allowing the user to set via config, and we should use a strong key name to achieve that clarity. If we are going to interpret the settings as bucket boundaries, then buckets is a very good suffix. Although it might sound a bit informal, it's a term widely used in discussing and documenting distribution summaries so that's fine.

On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives method then the suffix needs to reflect that.

Or would we expose both?

Channyboy commented 11 months ago

@tjquinno @donbourne

Springboot's management.metrics.distribution.slo property uses ms as a default if no time unit is defined.

Channyboy commented 11 months ago

@tjquinno @donbourne

Will go ahead and update PR with the change to use mp.metrics.distribution.timer.buckets and use default units of ms

donbourne commented 10 months ago

If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.

I expect most timings will be for methods or services, and generally ms is the right granularity for those, so I think ms makes sense. While Timers can be used to time other long-running things, I expect that's less common.

On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives method then the suffix needs to reflect that.

while Micrometer calls these SLOs, that doesn't surface in the MP Metrics API, so I don't think we need to do things the same as Micrometer here. I'm ok with either buckets or slo. On balance I lean towards buckets because Timers don't have to be timing things for the purpose of tracking SLOs, and to show the semantic consistency with the buckets setting for Histograms.

Channyboy commented 10 months ago

@tjquinno We're aiming to release an RC1 this friday (or sooner!) May you please review.

Channyboy commented 10 months ago

@tjquinno