beam-telemetry / telemetry_metrics

Collect and aggregate Telemetry events over time
https://hexdocs.pm/telemetry_metrics
Apache License 2.0
205 stars 33 forks source link

Remove the sum metric #49

Closed josevalim closed 4 years ago

josevalim commented 5 years ago

Now that we have added summary, it probably makes sense to remove sum. This will also makes us fully aligned with opentelemetry types.

josevalim commented 5 years ago

/cc @binaryseed @arkgil

arkgil commented 5 years ago

I'm not sure why summary replaces the sum? Do you mean that it should be included in summaries provided by reporters?

josevalim commented 5 years ago

Yes, that’s my understanding. OpenTelemetry doesn’t provide sum either.

arkgil commented 5 years ago

From [the spec] it looks like OpenTelemetry gauge is closer to what we call a sum. Maybe we should wait a bit until OpenTelemetry spec is completed?

josevalim commented 5 years ago

Wouldn't gauge be last_value?

arkgil commented 5 years ago

Yes, that's right, but OpenTelemetry also allows increasing and decreasing gauges by given value. So our sum could be OpenTelemetry gauge where we add/subtract, and with last_value we only set it to absolute value.

tsloughter commented 4 years ago

Just saw this and wanted to mention that I think the OpenTelemetry metrics API defined in the specification repo is fairly complete now. And there is an open PR to add the SDK spec https://github.com/open-telemetry/opentelemetry-specification/pull/347

I should have mentioned this earlier since they've been looking for feedback for a while now, I've been too busy completing the tracing stuff that I hadn't even thought about metrics.

There are also Metrics "office hours" every Thursday at 11am PST https://lightstep.zoom.us/j/872020227

I don't think it has views yet, like OpenCensus. We might not have the pain and incompatibility between Telemetry.Metrics and OpenTelemetry like we do with it and OpenCensus if we start considering it now.

Not sure if Telemetry.Metrics would be at the API or SDK level of OpenTelemetry metrics, I guess that is the first thing I should figure out.

We can discuss more at the next Observability WG meeting which I plan to schedule for next week.

arkgil commented 4 years ago

@tsloughter thanks for these links, it's an interesting read. I see that the spec has diverged from OpenCensus histograms and summaries, and provides a way to record arbitrary events. Happy to discuss this at the meeting!

josevalim commented 4 years ago

So the current status in OpenTelemetry is that they are not tracking metrics per-se anymore. They are getting closer to Telemetry itself than Telemetry.Metrics. So I guess this is our call to make. @arkgil do you see any use for sum now that we have summary? Does any of the backends (stats, prometheus, newrelic, etc) have a use for sum? /cc @bryannaegele @deadtrickster @binaryseed

tsloughter commented 4 years ago

@josevalim hm? why do you say that?

josevalim commented 4 years ago

@tsloughter afaik, they removed the "summary" and "distribution" parts in favor of something that just emits the events (they do have something like counter+gauges though). Or have I misunderstood it?

tsloughter commented 4 years ago

@josevalim I'm pretty sure in the SDK it defines summaries https://github.com/open-telemetry/opentelemetry-specification/blob/6f3fbdbbad4a036f817ab6cb76d0b88e25fd9ae2/specification/sdk-metric.md#quantile-definition

The SDK isn't finalized yet, hopefully soon, so I'm not sure.

We should hop on the metrics call on January 2nd or have a call with Joshua who has written up the SDK.

tsloughter commented 4 years ago

That said, having to calculate as little as possible in BEAM would be best. If I was concerned about the amount of traffic increase in not aggregating I'd be all for pushing out all measurements and having an agent/collector do all the aggregation.

josevalim commented 4 years ago

Oh, I was reading the Metrics API bit and completely ignored the SDK ones. Yes, let's address it in the next meeting.

bryannaegele commented 4 years ago

Sum can be handled by Gauge in Prometheus, so it should be OK to remove for that use case.

The Summary metric has sum and count as the base requirement for implementation of that view but it's over a sliding time window. That is all I plan to support to start. I have enough concerns about performance of those calculations. I'll likely add support for doing quantiles but with a blaring warning and telemetry events for gauging impact.

tsloughter commented 4 years ago

You should check out the Observer OTEP as well https://github.com/open-telemetry/oteps/pull/72

Feedback from the people here would be great to help the final product work well for elixir/erlang :)

binaryseed commented 4 years ago

At this point, New Relic "dimensional metrics" support gauge (point-in-time), count (occurrence count of events) and summary (pre-aggregated min, max, avg, sum, etc)

The sum seems like it'd map directly to NewRelic's count and partially to the summary

https://docs.newrelic.com/docs/data-ingest-apis/get-data-new-relic/metric-api/report-metrics-metric-api#supported-metric-types

Side note, From my experience, pre-aggregating is an un-aviodable part of the whole picture. At high throughput, it's nearly impossible to get statistically accurate metrics w/o pre-aggregating because sampling becomes necessary. Also, doing a little math is always faster than a network call :)

arkgil commented 4 years ago

StatsD (including DataDog) both have sum as a part of another metric (DataDog "summary" and vanilla StatsD "timing"), so we should be fine on that end.

One concern I have, though, is that calculating only sum would be probably cheaper than having it computed along with a few other metrics. This might be even more important when we aggregate in process.

tsloughter commented 4 years ago

Not sure why there is need for separate backends? We should really be able to pool the work instead of duplicating it all over (opentelemetry, telemetry.metrics.prometheus/statsd, new relic).

I don't think breaking stuff out of OpenTelemetry if you find it has changed the API or become an abandoned project would be that bad, so I'd suggest taking a look, @bryannaegele, and we should discuss.

Especially since OpenTelemetry is an open project that isn't finished and looking for more input to guide the project forward. Meaning we can push for changes if they are needed.

.net is going an interested direction with this, sort of doing doing similar to telemetry and telemetry.metrics but based on working with OpenTelemetry. I'll try to find more details on that to better show what I mean.

mcrumm commented 4 years ago

Question: In light of the conversation in #68, should this be closed?

// @arkgil

arkgil commented 4 years ago

@mcrumm not sure about that yet, I want to bring this up at the today's Observability WG meeting.

jredville commented 4 years ago

Just FYI, related to #68 and the follow-up conversation with @arkgil in telemetry_metrics_statsd, I've just opened beam-telemetry/telemetry_metrics_statsd#30 which leans on the sum metric. If the previous meeting changes that direction, then I can update the PR accordingly

arkgil commented 4 years ago

Sorry, forgot to post an update. Let's close this issue for now, we won't be able to remove the sum metric until Prometheus reporter implements summary.

Also, it's not yet clear if we actually want to remove the sum completely.

Thank you all for your input here!