eclipse / microprofile-metrics

microprofile-metrics
Apache License 2.0
100 stars 66 forks source link

Add Timed histogram #587

Closed theute closed 1 year ago

theute commented 4 years ago

It would be very useful to extend "Timed" (or another annotation) to have histogram data.

We use those metrics to calculate our SLO, so all responses above 2s aren't meeting our SLO and we need to know. With buckets and total count of hits, we could calculate this, with the current metrics exposed by "Timed" or "SimplyTimed" we can't.

With Python the Starlette exporter does this and it happens to be very useful, now we can tell of many requests are above or below a particular threshold. The list of buckets should be configurable.

Here are metrics I get for an app in Python that I'd love to see for my Java app (the 2 last metrics are the same metrics exposed by "SimplyTimed" )

starlette_requests_processing_time_seconds_bucket{le="0.005",method="GET",path_template="/metrics"} 12705.0
starlette_requests_processing_time_seconds_bucket{le="0.01",method="GET",path_template="/metrics"} 12746.0
starlette_requests_processing_time_seconds_bucket{le="0.025",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.05",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.075",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.1",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.25",method="GET",path_template="/metrics"} 12785.0
starlette_requests_processing_time_seconds_bucket{le="0.5",method="GET",path_template="/metrics"} 12788.0
starlette_requests_processing_time_seconds_bucket{le="0.75",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="1.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="2.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="5.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="7.5",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="10.0",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_bucket{le="+Inf",method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_count{method="GET",path_template="/metrics"} 12789.0
starlette_requests_processing_time_seconds_sum{method="GET",path_template="/metrics"} 23.30231829994591

Thanks !

jmartisk commented 4 years ago

With @Timed you currently get a histogram, except it's somewhat "reversed"

As I understand it, you want a histogram that tells you "X1 requests have finished under Y1 seconds, X2 requests have finished under Y2 seconds,..", where Y1, Y2,... are user-defined buckets and X1, X2 are computed numbers

while with @Timed you currently get "X1 percent of requests have finished under Y1 seconds, X2 percent of requests have finished under Y2 seconds", where X1, X2,.. are the buckets (pre-defined by the spec) and Y1,Y2,.. are computed numbers

If we did both these formats of the histogram at the same time, the output would get very confusing. Perhaps we could have a switch that switches between one format and the other, but that's still not very easy to grasp. I'm not even sure what we would call them. Then there is the question how to do that switch:

Adding extra fields to the metadata of just one particular metric type is not something that the current MP Metrics API is quite ready for, because each metric type shares the same metadata type, so the model would become confusing, because the field would exist for all metric types, including those where it's not applicable.

I can imagine having an experimental config option in SmallRye that switches ALL timers to the other histogram format.. But I don't have any other idea right now how to do this without getting too crazy.

Allowing to define the percentiles (X1,X2,..) in the current format would not be sufficient for you, I suppose?

theute commented 4 years ago

Right, I saw the percentiles with "Timed", but I'm really interested to know data based on threshold values and less on percentage.

For instance with 3Scale I get this chart in Grafana, and it's quite useful, it's easy to spot the exceptions, the distribution as well as the load at a particular time: (on lower traffic the percentiles are not very uesful) image

image

jmartisk commented 4 years ago

So I can imagine adding a SmallRye-specific (at least for now; if it proves useful it will be backported to the spec later) option that switches timers (and potentially raw Histograms as a metric type too) from computing percentiles (default) to computing counts of values which are over some thresholds.

This could be implemented as either

  1. a global configuration property that switches all timers/histograms to this. Another property could specify the thresholds, but then they would probably need to be the same for all timers/histograms, which is probably bad, especially since histograms can also use non-time based units
  2. an boolean entry in ExtendedMetadata, so programmatically created timers/histograms using ExtendedMetadata would be able to make use of this, we could also add an array-type field to specify the threshold values
  3. we could also introduce our own SmallRye-specific version of the @Timed annotation that enables this (there's no annotation for histograms, so they would need to use ExtendedMetadata directly)
  4. A combination of 2 and 3 - this would work for both timers and histograms, and for annotated metrics as well as programmatically created metrics...

A combination of 2 and 3 sounds best to me right now. WDYT? Of perhaps @donbourne has got a better idea how to integrate this without turning the current MP Metrics API into a mess...

theute commented 4 years ago

I could imagine that some people would want both. Do we need a switch from one to the other ? I don't know the cost so maybe an option to turn some off would work (may have worked instead of @SimplyTimed as well ?). I think the buckets for Timers could/should be defined once for the app.

I don't understand the details of your options, I'm not used to SmallRye at all.

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

In the Google SRE Book https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ :

Latency
    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

I can open a new ticket for this, or maybe that's beyond what SmallRye metrics would want to provide and we should have our own filter anyway...

jmartisk commented 4 years ago

I think we could extend the @Timed annotation with an enum argument named histogramExportFormat that would have the values PERCENTILES and THRESHOLD_VALUES and also BOTH if someone wants both formats to be exported at the same time. The metric metadata would recieve the same field.

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through https://github.com/eclipse/microprofile-metrics/pull/585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated.

jbndbc commented 4 years ago

I would like to second this request, since the OpenMetrics summary types currently being produced cannot be combined together by for example prometheus because quantiles cannot be averaged and can therefore only be used for metrics that make sense at an individual instance level.

theute commented 4 years ago

For the buckets, I think defining them once per app can be problematic, because histograms can also use non time-based units, so it's not clear how the setting would apply to them. Then we'd probably need to apply the setting only to time-based histograms, and require other histograms to define their buckets on a per-metric basis...

I don't really know :)

Now, I can think of other "feature", we time our REST calls (JAX-RS), seprateing calls that endup with server errors (not from total) from successful ansers would also be useful, this also applies to percentile. Having calls that ends with HTTP500 aren't interesting for latency checks.

This should be addressed in Metrics 3.0 through #585 - I hope this will satisfy your requirements - the statistics about erroneous calls will be separated

I guess so, thanks !

jmartisk commented 4 years ago

I reported https://github.com/smallrye/smallrye-metrics/issues/325 because this will be first implemented as an experimental feature in an implementation before backporting it to spec, let's discuss implementation details there for now

ebullient commented 4 years ago

Micrometer timers can have distribution summaries or histogram buckets associated with them. Would this satisfy what you're asking for? https://micrometer.io/docs/concepts#_timers + https://micrometer.io/docs/concepts#_histograms_and_percentiles