beam-telemetry / telemetry_metrics_prometheus_core

Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
Apache License 2.0
35 stars 30 forks source link

No option to delete value related to specific set of tags in ETS table #46

Open FelonEkonom opened 2 years ago

FelonEkonom commented 2 years ago

There is no option, to delete an existing entry in ETS table. For example, if I have a sum metric with some tags, there is no option to remove value related to a specific set of tags. Because of that, size of reports generated during scrapes can only grow, and there is no possibility to remove values, that are no longer needed from these reports.

bryannaegele commented 2 years ago

That is expected behavior for Prometheus. If you're running into size issues that would be an indication that your tags have too much cardinality.

FelonEkonom commented 2 years ago

Let's assume, that I have a system, that has many jobs running inside it. Every job has its lifetime and I want to have a tool, that will help me aggregate some metrics about these jobs. In this case, job id would be a tag, that I would group metrics by. I think, that in systems like this, you don't want to have metrics about obsolete, ended jobs in reports generated during scrapes. That is why, I think, having the option to delete metrics related to a job, that is ending, would be a great idea. Also, in this case, the cardinality of tags does not come from bad system design, but will naturally increase with a lifetime of whole systems, as upcoming jobs will start and end.

bryannaegele commented 2 years ago

Prometheus is simply not the right tool for the requirements you're describing. Prometheus creates a timeseries for every combination of metric attributes attribute values and those are stored in the prometheus server for the whatever the duration of the storage is set to.

I think for the use case you're describing you would be better served with tracing where cardinality in attributes is not a concern and you can get insights on multiple operations by a common attribute+value, in your case a job id.

https://github.com/open-telemetry/opentelemetry-erlang combined with Lightstep, Honeycomb, Zipkin, Grafana, etc would better fit your requirements. If you want more help or opinions you can get a lot of help in the #opentelemetry channel in the Elixir Slack.

hairyhum commented 1 year ago

It's true that prometheus is storing everything, but it still has a retention time in the server configuration. So by default after 15 days the old time series will be removed. But this reporter implementaion does not have such retention time and will keep reporting old time series on every scrape. This means that old time series which could have been removed by prometheus already keep getting updated unnecessarily. Some sort of cleanup on the reporter side would be helpful, whether it's a delete function or a retention time.

Rados13 commented 7 months ago

Hi @bryannaegele, what do you think about a suggestion from @hairyhum?

bryannaegele commented 7 months ago

I'm fine with that if someone wanted to submit a PR for an expiration setting but I am not personally adding features to this library at this time.