OpenObservability / OpenMetrics

Evolving the Prometheus exposition format into a standard.
https://openmetrics.io
Apache License 2.0
2.33k stars 171 forks source link

Support renaming metrics #189

Open ehashman opened 3 years ago

ehashman commented 3 years ago

Originally reported at https://github.com/prometheus/prometheus/issues/8579

Proposal

In projects like Kubernetes, we have thousands of contributors and metrics, and hence it is common for metrics that end users rely on to be created inconsistently. It is very difficult and disruptive to standardize/normalize metrics right now as there is no means of renaming a metric in the open metric format. To work around this, we follow a multi-release deprecation cycle for end users to update all their use cases. This is extremely painful for both the development cycle and end users as it currently takes 3 full releases just to rename a metric: 2 overlapping releases supporting both metric names, 1 additional release for removal. Hence, there is a strong disincentive to fix metric naming issues, even if they would provide an improved user experience, because there is so much work and technical overhead in proportion to the benefits.

It would be much improved experience if there was a means of renaming metrics, so clients could update the metric names to something more reasonable while not breaking the world.

If an exported metric looks like

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 42

A renamed metric could perhaps look like

# HELP not_go_goroutines Number of goroutines that currently exist.
# TYPE not_go_goroutines gauge
# ALIAS not_go_goroutines go_goroutines
not_go_goroutines 42

(ht @RichiH https://github.com/prometheus/prometheus/issues/8579#issuecomment-796769353 for the ALIAS suggestion)

Clients could continue to query for either go_goroutines or not_go_goroutines interchangeably so long as this metadata flag was present at scrape. We would no longer need to worry about the overhead of duplicating metrics during a deprecation period. (I am not attached to this particular implementation, just throwing something out there for discussion.)

Why not rename the timeseries client-side?

Recording rules must be explicitly configured by clients. Clients must therefore know that the rename happened. This is very manual and labour-intensive. I would like a mechanism at the scrape point or server side to be able to indicate a rename.

brian-brazil commented 3 years ago

Clients could continue to query for either go_goroutines or not_go_goroutines interchangeably so long as this metadata flag was present at scrape. We would no longer need to worry about the overhead of duplicating metrics during a deprecation period

This is not the case, this would be basically the same net effect and overheads as duplicating the metrics - with the additional cost that the observable samples no longer match what's in the format, making life harder for all implementors and downstream users.

I don't think this is something that we should be specifying in OpenMetrics, particularly as we should be looking to standardise existing practices and I'm not aware of any monitoring system out there that does this.

ehashman commented 3 years ago

Clients could continue to query for either go_goroutines or not_go_goroutines interchangeably so long as this metadata flag was present at scrape. We would no longer need to worry about the overhead of duplicating metrics during a deprecation period

This is not the case, this would be basically the same net effect and overheads as duplicating the metrics - with the additional cost that the observable samples no longer match what's in the format, making life harder for all implementors and downstream users.

Hi @brian-brazil, given what you've written here, I don't think you understand the use case.

Currently, here is how Kubernetes handles deprecation of metrics, say, if we add a new name.

In order to rename metrics, we end up having to duplicate what gets served and scraped, because there is no mechanism to tell clients that the metric is the same but has been renamed. Hence, we could instead have:

and thus avoid duplication of metrics in two releases.

It's unclear to me why having a rename capability would have the same net effect and overhead as duplicating a metric, as you'd only need to serve/scrape the metric once, as opposed to the duplication present above. The scrape client could then maintain a lookup table for the multiple names, which should have relatively low overhead.

brian-brazil commented 3 years ago

given what you've written here, I don't think you understand the use case.

I understand the use case.

The scrape client could then maintain a lookup table for the multiple names, which should have relatively low overhead.

This is not the case. In TSDB terms it'd have essentially the same overhead as ingesting a duplicate metric, as it need to be considered per time-series so needs an entry in the index with all of those costs. Thus this has no real performance gain compared to a duplicate metric, and still doesn't help with updating dashboards etc. while breaking the invariant that what's in the exposition is what's in the database, thus leading to confusion.

If you think this would be useful, I'd suggest mentioning it in the help string while also duplicating and seeing how that works out in practice for end users.

ehashman commented 3 years ago

This is not the case. In TSDB terms it'd have essentially the same overhead as ingesting a duplicate metric, as it need to be considered per time-series so needs an entry in the index with all of those costs. Thus this has no real performance gain compared to a duplicate metric, and still doesn't help with updating dashboards etc. while breaking the invariant that what's in the exposition is what's in the database, thus leading to confusion.

That sounds like an architectural/implementation issue to me, rather than technical infeasibility.

Furthermore, performance is not the only overhead to consider, because the current situation already suffers from the performance overhead of having two timeseries. We must also consider the human costs in expecting end users to directly be responsible for taking action on all metric renames.

If you think this would be useful, I'd suggest mentioning it in the help string while also duplicating and seeing how that works out in practice for end users.

I don't think this is sufficient for the Kubernetes use case as we have thousands of metrics. Adding it to HELP text would require users to manually look at metrics, and is no different than the status quo, where renaming burdens are pushed directly on sysadmins. Adding support for automatic relabelling means that a user does not have to think about this.

brian-brazil commented 3 years ago

We must also consider the human costs in expecting end users to directly be responsible for taking action on all metric renames.

Your proposal doesn't change this, end users still need to rename everything by release 4. Accordingly I think the tooling around this is where you should focus investigation into the utility of this feature.

debuglevel commented 1 year ago

Would really be nice to support something like # ALIAS. Prometheus exporters suffer sometimes from quiet inconsistent metrics naming. Supporting aliases would help to stimulate cleaning up metric names.

Version 0.1:

# TYPE app_unicorn counter
# HELP app_unicorn How many unicorns were spotted.
app_unicorn 12

Version 0.2:

# TYPE app_unicorns counter
# HELP app_unicorns How many unicorns were spotted.
# ALIAS app_unicorns app_unicorn 
app_unicorns 12

Version 1.0:

# TYPE app_unicorns_total counter
# HELP app_unicorns_total How many unicorns were spotted.
# ALIAS app_unicorns_total app_unicorns app_unicorn
app_unicorns_total 12

If e.g. Prometheus recorded app_unicorn for some months from a Version 0.1 exporter which was updated to Version 1.0, it would check for ALIASes and rename all app_unicorn in its database to the new app_unicorns_total name. Another idea would be to not rename it in the database, but just return the same for app_unicorn, app_unicorns and app_unicorns_total queries.

That would of course, only work if you use a (e.g.) Prometheus version which already supports this feature. The time between might be pretty annoying and confusing (e.g. there are already exporters using this feature, but you did not update Prometheus for a year).