eclipse / microprofile-metrics

microprofile-metrics
Apache License 2.0
100 stars 66 forks source link

Need a @SimpleTimer? #496

Closed donbourne closed 4 years ago

donbourne commented 4 years ago

The current Timer includes a lot of calculated metrics which, while useful for simple consumers that can't calculate things like means and distributions, is not necessary nor recommended for metrics intended to be consumed by Prometheus.

For Prometheus the better approach would be to provide just the count of hits and the total time spent in the method. Prometheus can do the other calculations based on this raw data, with less CPU cost to the instrumented microservice for computing the metric and with much less memory consumption.

We could do this as a new annotation @SimpleTimer or could make this an option on the existing @Timed annotation. (Note the name @SimpleTimer could use work).

jmartisk commented 4 years ago

I'd vote for having a wholly new metric type rather than configuring an existing type, because that would kinda break some assumptions that currently can be made about the metric model. For example it would get problematic due to existence of the @Metric annotation which needs to contain the union of all parameters of other metric annotations, so it would need to have this parameter too, and that parameter would be applicable only when used with simple timers, so the API would get somewhat chaotic.

@SimpleTimer sounds good to me, at least I don't have any better idea at the moment

donbourne commented 4 years ago

Concern I have with @SimpleTimer as a name is that it doesn't follow the pattern of the other names...

This method is @Timed This method is @Counted This method is @Metered This method is @SimpleTimer -- doesn't follow pattern

Note that I think we already broke the pattern above with @ConcurrentGauge.

So perhaps it could be called @SimplyTimed ?? I know that sounds odd.

keilw commented 4 years ago

It really sounds weird, why can't the annotation "simply" get some optional attributes like "calculate" or similar to control whether it should do calculations by itself or leave it to a third party?

There are different cases like processing certain operations on the client instead of server to save cloud bandwidth, so there is certainly use for both a self-calculated and "remote-calculated" approach.

jmartisk commented 4 years ago

Well, this certainly would be possible, but it would break some assumptions that can currently be made about the Metrics model.

If you add a new parameter to the @Timed annotation that other metric annotations don't have, how will you reflect this change in the @Metric annotation? That special annotation basically has the union of all parameters that all regular metric annotations have. This would mean @Metric would need add a new parameter that is valid only in case that the @Metric annotation is used in the context of a Timer. That's why we currently try to keep the sets of parameters the same across all metric types. The only minor exception that we currently have is that the reusable attribute is not applicable to gauges.

And it's not just about the @Metric annotation, the same would need to be added to the Metadata interface. In that case, to keep some sanity in the model, we'd probably have to introduce different metadata interfaces per each metric type, because different metric types will no longer share (almost) the same metadata.

I agree your suggestion makes sense, but I think the model isn't quite prepared for this at the moment without some more fundamental rethinking.

avnerstr commented 4 years ago

Maybe instead of using a new annotation, we can add a new path- like /short-metrics - that will be used for Prometheus or other that need the short data while metrics will continue to display the simple metrics for backward compatibility and simple users WDYT? you are also welcome to think about a new metrics path name, not sure it's the best

avnerstr commented 4 years ago

The current Timer includes a lot of calculated metrics which, while useful for simple consumers that can't calculate things like means and distributions, is not necessary nor recommended for metrics intended to be consumed by Prometheus.

For Prometheus the better approach would be to provide just the count of hits and the total time spent in the method. Prometheus can do the other calculations based on this raw data, with less CPU cost to the instrumented microservice for computing the metric and with much less memory consumption.

We could do this as a new annotation @SimpleTimer or could make this an option on the existing @Timed annotation. (Note the name @SimpleTimer could use work).

I have a question regarding the @SimpleTimer, how you can provide the total time spent in method to Prometheus?

jmartisk commented 4 years ago

Maybe instead of using a new annotation, we can add a new path- like /short-metrics - that will be used for Prometheus or other that need the short data

One of the goals of SimpleTimer AFAIU is to be more lightweight in terms of the CPU/memory cost on the server side. Adding a new endpoint would only increase resource consumption.

Also, these short/long variants would currently be only applicable to timers (maybe meters potentially too, not sure) so it's a question whether it's worth complicating things with two different endpoints to get just one or two metric types exported differently. If we decided we need a similar approach (let the client decide what they want), I'd rather prefer a flag that can be passed by the client using a HTTP header or something.

I have a question regarding the @SimpleTimer, how you can provide the total time spent in method to Prometheus?

I think I don't understand your question, are you asking why want to do this, or how it will be technically achieved?

donbourne commented 4 years ago

Maybe instead of using a new annotation, we can add a new path- like /short-metrics - that will be used for Prometheus or other that need the short data

I agree with the thought behind this -- which I think is that it would be nice if choosing what kind of output you get from Timers was a client-side decision. To expand a little on what @jmartisk said though, the moving averages, percentiles, actually involve ongoing computation as the metrics are recorded, not just at the time the metrics are exported, which is why this doesn't work well as a client-side decision.

Another option could be to make this a deployment time decision by giving a config property people could set to indicate what Timers should track and what fields they would make available for export. Code that interacts directly with Timers would need to deal with the possibility that a Timer would sometimes implement one interface and sometimes another.

That all said, I think it perhaps comes down to whether we foresee SimpleTimers superceding Timers or whether we see each being helpful in different circumstances.

If we think SimpleTimers supercede Timers we perhaps would want to deprecate Timers and remove them in the future (or even remove SimpleTimers in the future and just make Timers always behave like SimpleTimers).

If we think SimpleTimers and Timers are helpful in different circumstances, and we think that it's a deployers choice, then perhaps the deployment-time switch makes sense.

If we think SimpleTimers and Timers are helpful in different circumstances, and we think that is a developers choice, then we'd want two separate timer types and annotations.

Having worked with quite a few metrics now, as well as Prometheus/Grafana my personal opinion is that SimpleTimers supercede Timers as there's little use for a heavy-weight Timer built around exponentially weighted moving averages.

avnerstr commented 4 years ago

I understand your point regarding the endpoint. we can add a configuration option to the user, weather he wants to use the simple/short endpoint or the heavy-weight or both. In that way, in case he choose only the simple, there won't be heavy calculation on the MS and we'll be left only with the existing @Timer annotation. so there will be no need for developers to rewrite their annotations.

regarding : "I have a question regarding the @SimpleTimer, how you can provide the total time spent in method to Prometheus?

I think I don't understand your question, are you asking why want to do this, or how it will be technically achieved?"

I mean technically, I looked in the Dropwizard and there is no SimpleTimer there, so I wonder what is the right implementation for that. If I understand correctly, in order that Prometheus will do the calculations, we need to provide him all the elapsed time of all the methods between each sample. otherwise his calculations won't be accurate. so eventually it can cause us to send more data to Prometheus than on the @Timer. WDYT?

avnerstr commented 4 years ago

I have a question regarding the simpleTimer - it means that if for example I will return count and elapsed time of all the requests together. It means that I'm loosing data. I can only calculate the avg time of the requests. Is this what you mean by doing simpleTimer?

donbourne commented 4 years ago

I have a question regarding the simpleTimer - it means that if for example I will return count and elapsed time of all the requests together. It means that I'm loosing data. I can only calculate the avg time of the requests. Is this what you mean by doing simpleTimer?

Primarily you will be able to get the average time of the requests. With Prometheus you can look at the average since the metric was created, or you can look at the average in the last few minutes or the average offset to look at some time period in the past. You could also derive things like the max / min elapsed time -- though that would be calculated per sampling period, so would tell you max / min for a given time range, not max / min for individual requests.

avnerstr commented 4 years ago

I have a question regarding the simpleTimer - it means that if for example I will return count and elapsed time of all the requests together. It means that I'm loosing data. I can only calculate the avg time of the requests. Is this what you mean by doing simpleTimer?

Primarily you will be able to get the average time of the requests. With Prometheus you can look at the average since the metric was created, or you can look at the average in the last few minutes or the average offset to look at some time period in the past. You could also derive things like the max / min elapsed time -- though that would be calculated per sampling period, so would tell you max / min for a given time range, not max / min for individual requests.

But it this way, we are loosing a lot of important data. the elapsed time is aggregation of all the requests during the sampling period. it can be problematic when I want to monitor application behaviour.

donbourne commented 4 years ago

@avnerstr , I'm not sure we're thinking the same thing. You can have as many separate SimpleTimers as you want. For example, for JAX-RS requests we're aiming to have a separate SimpleTimer for each class/method name combination.

Can you say more about what data you think you need that we may be missing?

avnerstr commented 4 years ago

@avnerstr , I'm not sure we're thinking the same thing. You can have as many separate SimpleTimers as you want. For example, for JAX-RS requests we're aiming to have a separate SimpleTimer for each class/method name combination.

Can you say more about what data you think you need that we may be missing?

let me understand if I get you right: in the @simpleTimer I will get the total of request to the same request path and the sum of all the requests time (elapsed time) for example if I have 2 request to /foo one took 1sec another took 3sec between sampling in the simple timer I will have: count: 2 elapsedTime: 4

if this is the case, how can I calculate in prometheus what is the max time of the request?

donbourne commented 4 years ago

you can't -- fair point. Perhaps we should consider adding recent max/min like we have recently introduced with ConcurrentGauge. I think we should avoid making SimpleTimer expose things that can't be calculated by tools consuming the metrics, but max/min aren't derivable from total hits and elapsed time. That could potentially happen after version 2.3, since it's just adding new stats to the same metric.

avnerstr commented 4 years ago

you can't -- fair point. Perhaps we should consider adding recent max/min like we have recently introduced with ConcurrentGauge. I think we should avoid making SimpleTimer expose things that can't be calculated by tools consuming the metrics, but max/min aren't derivable from total hits and elapsed time. That could potentially happen after version 2.3, since it's just adding new stats to the same metric.

The problem with SimpleTimer as I see it, that the only calculation that can be done giving this data, for every sampling, on external tool is mean. If I will have high variance between requests execution time, giving only elapsed time and count will not give much. I think that maybe the SimpleTimer should be just a short list of metrics than the Timer. for example - only par95, par99 and stddev. otherwise, SimpleTimer can really give you false feeling about your system performance

donbourne commented 4 years ago

@avnerstr I would assert that mostly you would use the average elapsed time per request value (within some time range) and min/max (within some time range) most. With that in mind I'm thinking two things...

  1. we could add min / max to the SimpleTimer (in the style of concurrentGauge)
  2. we could add elapsedTime to Timer (so that if you're paying for the heavier timer at least you can do simple averages too).

I'd also like to propose that we keep SimpleTimer as count and elapsedTime for MP Metrics 2.3 and add the min/max in the next version we put out as we're getting close to the end of time for 2.3. Doing so would not break backward compatibility for apps, so should be fairly easy to add.

Does that make sense?

keilw commented 4 years ago

So the whole Timer / Timed structure is completely duplicated now? A bit like Integer vs. BigInteger in the JDK ;-|

donbourne commented 4 years ago

It is -- but the reason for that, IMO, is that over time I expect SimpleTimer to supercede Timer. That's why it makes more sense to start a new simpler, cheaper one than to edit the existing one.

That said, by adding elapsedTime to Timer in the future, and adding min/max to SimpleTimer, people can choose if they need the bigger function of Timer, which requires much more computation, or the stats that SimpleTimer brings. Timer can still be useful -- but probably in select cases where you care about percentiles in detail.

keilw commented 4 years ago

So what should happen, should it later be renamed? The best way for the API elements like Timer could have been to somehow make them compatible, right now the only thing they have in common is the tagging interface Metric. Of course for annotations that is not so easy. The naming is clearly quite unfortunate, the only example I could think of would be SimpleDateFormat which extends the general DateFormat, otherwise the creation is a bit more simple and intuitive.

donbourne commented 4 years ago

@keilw, I don't think the names are too bad -- going forward we'll have a Timer (for more statistically deep timing needs), and a SimpleTimer (for typical things like timing requests).

I'm not sure API compatibility between Timer and SimpleTimer matters. Consistency, yes... which we have because Timer and SimpleTimer have almost the same Java API.

keilw commented 4 years ago

Well almost more importantly is Timer the only case where calculations are done, or are we likely going to see all kinds of Simple* duplications like SimpleGauge next? For just one extra annotation plus the underlying types it may still be acceptable, but if more gets duplicated it would be better (although a bit late now because they are in the same package) to use some modularity and allow people to use either of these and not even bother with the other one? ;-) I know there's a lot of mess in the JDK that did not allow e.g. a single Date/Time API or framework (instead there are at least 3 in java.base entangled forever) but if you aim for "Micro" wouldn't it be better to have a choice and not carry unnecessary payload with you? Again, if there's never more "simplification" than the Timer, it sounds tolerable, but if more was to come, I'd rather offer modules for those kinds of things.