beam-telemetry / telemetry_metrics_prometheus_core

Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
Apache License 2.0
35 stars 30 forks source link

sum of float fails because of ets:update_counter/4 usage #35

Closed tudborg closed 9 months ago

tudborg commented 3 years ago

I have a sum over the amount of request duration from Plug.Telemetry more or less defined like this:

      sum(
        "http.request.duration.seconds.total",
        event_name: "phoenix.endpoint.stop",
        measurement: :duration,
        unit: {:native, :second}
      )

Telemetry.Metrics converts the native time integer to a float. I think the behaviour makes sense. The alternative of just using the output from System.convert_time_unit/3 would be that most of my measurements would be 0s because they are sub-second.

However, this fails because the Sum module's handle_event/4 is using ets:update_counter/4 which only works with integers. See line 61 of from: https://github.com/beam-telemetry/telemetry_metrics_prometheus_core/blob/cd8bea3df0c0b4c6aaad5c2b7d42760cbadf0c07/lib/core/sum.ex#L53-L67

I'm not quite sure how to attack this problem. If the conversion happened when rendering instead of when storing this wouldn't be an issue, but that only solves it for :native vs :second problem. in general I think it should be possible to sum any numeric. Not just integers.

bryannaegele commented 3 years ago

Hey @tbug. In order to support floats there would need to be a change to implement sums in the same manner that the distributions are handled as there is no way to support summing at the time of the call efficiently.

In your particular use case, you can just use the distribution which includes the sum as one of its aggregates. In the example you gave, you'd just look for the http_request_duration_seconds_total_sum. I realize it may not be ideal, but it will get you the value you're looking for and doesn't increase the system load any more than supporting floats in sum.

LostKobrakai commented 2 years ago

I'm wondering if instead of failing the value could be truncated before being used to update ets. That would make the feature usable. I'm not sure how worthwhile the usecase of unit: {:native, :second} actually is if measurements are in the sub-seconds.

narrowtux commented 2 years ago

I'm also having this issue, while I can work around this using distribution(), it exports 3 extra metrics per label which I don't need. In my use-case there would be about 1000 distinct label values so my concern is how this affects disk usage in prometheus.