akoutmos / prom_ex

An Elixir Prometheus metrics collection library built on top of Telemetry with accompanying Grafana dashboards
MIT License
602 stars 104 forks source link

Metric and label naming #115

Open lucazulian opened 2 years ago

lucazulian commented 2 years ago

Using prom_ex in all application in a company could be raise the problem about metrics and labels. The problem raises, for example, with multiple application with the same metric semantic but with different metric name, i.e.:

This causes an explosion of metrics with the same semantic.

Based on this style guide I propose to have a configuration with which to choose to move to a different way of metric naming. With the previous example this would be:

This would be a breaking change, but this should be managed with an opt-in configuration.

I would like to work on it if this could be interesting.

akoutmos commented 2 years ago

Hey Luca and thanks for reaching out!

Let me address a couple of things in your issue description prior to giving you my thoughts on this feature request (if I follow correctly that is haha).

This causes an explosion of metrics with the same semantic.

Prometheus will create a new time-series for a unique combination of metric names and their unique key-value label values. As a result, regardless of whether the application name is in the label or in the name, the same number of time-series will be created in Prometheus.

Based on this style guide I propose to have a configuration with which to choose to move to a different way of metric naming.

I remember reading this style guide several times over prior to writing a single line of code for PromEx...and my take away from it is perhaps a little different from yours it seems. When I read: "A metric name should have a (single-word) application prefix relevant to the domain the metric belongs to.", my feeling is that the "domain" is your application and not the PromEx plugin.

Another piece of literature that guided my decision (that I think corroborates the previous sentiment) was from the Robust Perception blog who I consider to be leaders in the monitoring space:

A metric named http_requests_total is close to useless. Is that HTTP requests as they enter the HTTP code, at the auth middleware, in the RPC subsystem or when the hit your code? That's key information when interpreting the metric.

http_server_requests_total, auth_http_requests_total, rpc_server_http_requests_total, and businesslogic_requests_total would be some better options. Keep in mind that while these all measure HTTP requests, as they're at different parts of the call stack it'd be incorrect to combine these into one metric.

With that being said, I am hesitant (not against 100% mind you) to support this kind of option in PromEx. One of the reasons that I am hesitant is that the Grafana dashboard support would effectively double as each PromQL query would need two versions (otp_app in label versus otp_app in metrics name). This becomes a maintenance nightmare since updating a Grafana dashboard JSON is for more involved now as opposed to doing 10 global finds and replaces that I currently do. In addition, I am reluctant to adding additional configuration to the plugins and dashboards that have a long reaching impact. It would be possible (and I lean towards this most of the time with other requests such as this), to copy the plugins and dashboards that you need, adjust them to your liking, and then still use mainline PromEx to load your version of the plugin and upload your version of the dashboard.

Sorry to be long winded, but this is a big change and I want to make sure that it is thoroughly discussed prior to me committing to anything. If you do want to create a PoC with this I would be happy to look it over and provide guidance. Off the top of my head, you would need to do the following:

Hope that helps and looking forward to your response :)

lucazulian commented 2 years ago

Hi Alexander, thank you so much for the answer. First of all, I would like to thank you for this library, it reduces a lot of the time used to instrument our applications and simplify monitoring stuff.

I agree with you that regardless of whether the application name is in the label or in the name, the same number of time-series will be created in Prometheus, we consider this as its "internal" optimization.

We proposed this for different reasons:

I agree with you that this would become a maintenance nightmare supporting two "different styles", we adopt as metric naming convention what stated in Metric and label naming:

For metrics specific to an application, the prefix is usually the application name itself

where we consider application the namespace (i.e.: ecto) and not the current app (specified in the mix.exs) that group all those extra_applications.

What inspires us to this way is this example in instrumentation:

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.

I think that before starting with the PoC you can imagine the result, which would be to make sure that it is thoroughly discussed with pros and cons about those two styles and maybe asking help directly to Prometheus developer in order to have some guidance about this issue.

Thank you for your time and support :pray:

akoutmos commented 2 years ago

I haven't forgotten about this issue :P. Still pondering on it and trying to think about how to proceed.

lucazulian commented 2 years ago

No problem, I understand that it's not a simple issue to reason about, if I can help in some ways don't hesitate to contact me. :pray:

lucazulian commented 2 years ago

I found this article about label cardinality, maybe this can help with this discussion

akoutmos commented 2 years ago

I think i have a good way to do this without making it difficult to maintain. I scheduled it for release 1.9. If you want to tackle this, lmk and we can sync up as to what I have in mind.

lucazulian commented 2 years ago

Yep, I would like to tackle this issue, thank you!

akoutmos commented 2 years ago

Awesome! That is much appreciated. When I have a moment in the next couple days, I'll jot down some notes here as that what I think could help move this feature along :).

akoutmos commented 2 years ago

Alrighty, as promised, here are some of my thoughts as to how I would possibly approach this problem (feel free to deviate from these suggestions as they are just suggestions πŸ˜„ ):

Hopefully that helps guide you in the right direction. LMK if you have any questions or suggestions πŸ˜„!

HarshBalyan commented 2 years ago

@lucazulian @akoutmos any update on this one?

lucazulian commented 2 years ago

@lucazulian @akoutmos any update on this one?

Hey, @HarshBalyan, thanks for your interest. I'm so sorry, but I'm still working on this as it would be a not so easy issue.

lairtonlelis commented 2 years ago

If anyone else is interested in renaming their metric prefixes, this is how I did it:

I was also able to use a blank prefix on metrics by adding metric_prefix: [:""], which basically removes the prefixes. However, this doesn't work for the Grafana dashboards, as the generated metrics have a single _ as a prefix if you set them to blank.

tsloughter commented 1 year ago

If you are going to consider changes to the metric names (I know here it is only about the prefix, but just in case) it'd be great to collaborate. I was thinking it'd be fortunate if prom_ex's dashboards worked with OpenTelemetry metrics (https://opentelemetry.io/) by default and was looking at the metric names used as we prepare to define the metrics exported by OpenTelemetry.

See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md for an example of how these get defined, right now it just has the JVM.

It may just be that we use your existing metric names (minus the app name prefix), in which case we'll be compatible except for the prefix, so thought this issue was a good place to put this.