Closed Channyboy closed 10 months ago
@tjquinno
After discussion with the liberty team.
We're proposing to change the property mp.metrics.distribution.timer.slo
to mp.metrics.distribution.timer
. This removes the slo as it is unneeded as the timer metric can be used for more than just SLO calculations.
Also the mp.metrics.distribution.timer.maximum-expected-value
and mp.metrics.distribution.timer.maximum-expected-value
and the equivalent for the histogram property we want to reduce it to just:
mp.metrics.distribution.timer.max-value
mp.metrics.distribution.timer.min-value
mp.metrics.distribution.histogram.max-value
mp.metrics.distribution.histogram.min-value
^^The reason being the expected
not being needed and other MP specs use max
and min
. Also we shouldn't really care if that original springboot was set that way.Further more, the mp.metrics.distribution.timer.slo
/ mp.metrics.distribution.timer
and the mp.metrics.distribution.(max/min)-value
properties we had defined that it would require a time unit (i.e. ms, s, m, h). Otherwise we would ignore it. We would like to provide a default unit of seconds
if no time unit was specified.
@Channyboy
Thanks for tagging me on this.
A few thoughts...
min-value
and max-value
: seems reasonable--shorter with no real loss in clarity.Regarding defaulting durations/SLOs unit to seconds...
There are other places in MP where time units default to something--for example, in fault tolerance @Timeout
defaults to ms and in that case ms seems like a very natural choice given what's being timed: method invocations.
With SLOs we have no idea in advance what a typical time range is going to be for any given timer because we don't know what sort of quantity any given timer measures (method invocations, atomic particle decay times, police and fire unit response times, government agency license application response times).
I'm OK with--and perhaps in favor of--ignoring with a warning or even rejecting with an error any unit-less SLO setting. Requiring explicit units (and failing start-up in their absence) removes any ambiguity.
For the config key for SLOs... I am inclined to keep the slo
suffix. It keeps very explicit what those values correspond to which helps the config keys be more "self-documenting."
Further, although we might not introduce them now, there could be other timer-related settings we'd want to be able to support via configuration in the future. For example, a Micrometer Timer
supports the publishPercentiles
method which accepts a list of values, as does slos
.
If, now in MP metrics config, we "up-level" the slo settings up to the timer
level, then later we add config for publish percentiles, it could be somewhat ambiguous to users which settings the "top-level" values correspond to and it might seem to arbitrarily ascribe more importance to the SLOs than to the percentiles.
Because slo
is a short string, we're not imposing a lot more typing time to users who edit config files and we keep some clarity now and, potentially, going forward.
@tjquinno
Regarding use of milliseconds as the default unit, that would make more sense. There is more control with milliseconds. Seconds was brought up as it was the same unit that will be used/converted-to in the Prometheus output. Regardless of which, I think the discussion brought up that we should have a default unit instead of ignoring values.
We could also try mp.metrics.distribution.timer.buckets
for a more generic sounding property?
**Not sure if the publishPercentiles was just used as a quick example. The mp.metrics.distribution.percentiles
applies to both Histogram and Timer metric. So for any implementation using micrometer, this config property would affect a timer being built using the publishPercentile(..)
method.
@donbourne wdyt?
I would suggest to use mp.metrics.distribution.timer
as it sounds more natural. Ending .buckets
sounds weird to specify the value with a time unit.
For (2), I'm in favor of having a default (s or ms). Assuming we have a default, the worst case scenario is that the person configuring the value thinks the default unit is different than what it actually is -- but since the metrics we generate show the buckets, that should be a fairly easy thing for the ops person to spot and fix. If we didn't have a default unit then the error handling would be more complex, both for the runtime (do we not start the app if the slo is configured in a file in the app? do we not start the server if it's configured in an env var?), and for the ops person (more ways to get something wrong that you have to correct). @Channyboy , which unit do the timer bucket values in the /metrics output have (our unit default should match, IMO)?
For (3), another argument for sticking with mp.metrics.distribution.timer.slo
or mp.metrics.distribution.timer.buckets
is that timers can have both percentile values and bucket time values, so it's not intuitive to just ask someone to fill in the values for mp.metrics.distribution.timer
. I think I slightly prefer mp.metrics.distribution.timer.buckets
over mp.metrics.distribution.timer.slo
, since there's no guarantee that you're timing something for the purpose of tracking an SLO, and since it's very semantically related to mp.metrics.distribution.histogram.buckets
.
The buckets for a Timer in Prometheus output will be converted to seconds.
For any incompatible values (either server or app level), I wouldn't think we would prevent the runtime from starting. We would just ignore it (with a warnning/error).
Re: 2. (default units)
I guess I will (reluctantly) step back from urging a start-up failure for unit-less settings!
Then the question is what's a good default unit. I mentioned the FT default not necessarily to advocate for that as the default for this usage but because, in the FT case, everyone knows that it is method invocations being measured and so ms seems like reasonable units for that. If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.
Re: 3. (config key suffix for bucket/slos)
The config key(s) must unambiguously identify what is being set.
IIUC (and I could well be wrong), the Micrometer API, as an example, allows setting bucket boundary values and also, via a separate method, SLOs.
The JavaDoc for Timer says (to me at least) that setting SLOs makes sure that those values are among those (perhaps along with others?) in the set of bucket boundaries.
My point is that we need to very clear exactly what we are allowing the user to set via config, and we should use a strong key name to achieve that clarity. If we are going to interpret the settings as bucket boundaries, then buckets
is a very good suffix. Although it might sound a bit informal, it's a term widely used in discussing and documenting distribution summaries so that's fine.
On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives
method then the suffix needs to reflect that.
Or would we expose both?
@tjquinno @donbourne
Springboot's management.metrics.distribution.slo
property uses ms
as a default if no time unit is defined.
@tjquinno @donbourne
Will go ahead and update PR with the change to use mp.metrics.distribution.timer.buckets
and use default units of ms
If we expect that most uses of timers are for measuring method invocations, then ms makes sense here as well.
I expect most timings will be for methods or services, and generally ms
is the right granularity for those, so I think ms
makes sense. While Timers can be used to time other long-running things, I expect that's less common.
On the other hand, if we are going to use the configured values as parameters to the serviceLevelObjectives method then the suffix needs to reflect that.
while Micrometer calls these SLOs, that doesn't surface in the MP Metrics API, so I don't think we need to do things the same as Micrometer here. I'm ok with either buckets
or slo
. On balance I lean towards buckets
because Timers don't have to be timing things for the purpose of tracking SLOs, and to show the semantic consistency with the buckets
setting for Histograms.
@tjquinno We're aiming to release an RC1 this friday (or sooner!) May you please review.
@tjquinno
Fixes #587, #674, #675, #676, #691