Metric and label naming

lucazulian commented 2 years ago

Using prom_ex in all application in a company could be raise the problem about metrics and labels. The problem raises, for example, with multiple application with the same metric semantic but with different metric name, i.e.:

for my_cool_app -> my_cool_app_application_dependency_info{modules="69",name="hex",version="0.20.5"} 1
for my_other_cool_app -> my_other_cool_app_application_dependency_info{modules="69",name="hex",version="0.20.5"} 1

This causes an explosion of metrics with the same semantic.

Based on this style guide I propose to have a configuration with which to choose to move to a different way of metric naming. With the previous example this would be:

for my_cool_app -> application_dependency_info{modules="69",name="hex",version="0.20.5",application="my_cool_app"} 1
for my_other_cool_app -> application_dependency_info{modules="69",name="hex",version="0.20.5",application="my_other_cool_app"} 1

This would be a breaking change, but this should be managed with an opt-in configuration.

I would like to work on it if this could be interesting.

akoutmos commented 2 years ago

Hey Luca and thanks for reaching out!

Let me address a couple of things in your issue description prior to giving you my thoughts on this feature request (if I follow correctly that is haha).

This causes an explosion of metrics with the same semantic.

Prometheus will create a new time-series for a unique combination of metric names and their unique key-value label values. As a result, regardless of whether the application name is in the label or in the name, the same number of time-series will be created in Prometheus.

Based on this style guide I propose to have a configuration with which to choose to move to a different way of metric naming.

I remember reading this style guide several times over prior to writing a single line of code for PromEx...and my take away from it is perhaps a little different from yours it seems. When I read: "A metric name should have a (single-word) application prefix relevant to the domain the metric belongs to.", my feeling is that the "domain" is your application and not the PromEx plugin.

Another piece of literature that guided my decision (that I think corroborates the previous sentiment) was from the Robust Perception blog who I consider to be leaders in the monitoring space:

A metric named http_requests_total is close to useless. Is that HTTP requests as they enter the HTTP code, at the auth middleware, in the RPC subsystem or when the hit your code? That's key information when interpreting the metric.

http_server_requests_total, auth_http_requests_total, rpc_server_http_requests_total, and businesslogic_requests_total would be some better options. Keep in mind that while these all measure HTTP requests, as they're at different parts of the call stack it'd be incorrect to combine these into one metric.

With that being said, I am hesitant (not against 100% mind you) to support this kind of option in PromEx. One of the reasons that I am hesitant is that the Grafana dashboard support would effectively double as each PromQL query would need two versions (otp_app in label versus otp_app in metrics name). This becomes a maintenance nightmare since updating a Grafana dashboard JSON is for more involved now as opposed to doing 10 global finds and replaces that I currently do. In addition, I am reluctant to adding additional configuration to the plugins and dashboards that have a long reaching impact. It would be possible (and I lean towards this most of the time with other requests such as this), to copy the plugins and dashboards that you need, adjust them to your liking, and then still use mainline PromEx to load your version of the plugin and upload your version of the dashboard.

Sorry to be long winded, but this is a big change and I want to make sure that it is thoroughly discussed prior to me committing to anything. If you do want to create a PoC with this I would be happy to look it over and provide guidance. Off the top of my head, you would need to do the following:

Update each existing plugin to have the otp_app as a label versus part of the metric name
Update all of the dashboards so that they can query the appropriate time-series depending on configuration
Add a global dashboard assign to toggle between the dashboard render mode

Hope that helps and looking forward to your response :)

lucazulian commented 2 years ago

Hi Alexander, thank you so much for the answer. First of all, I would like to thank you for this library, it reduces a lot of the time used to instrument our applications and simplify monitoring stuff.

I agree with you that regardless of whether the application name is in the label or in the name, the same number of time-series will be created in Prometheus, we consider this as its "internal" optimization.

We proposed this for different reasons:

reduce the number of metrics with a high number of applications sharing the same metrics (we give a high value on metrics semantic);
share the same label between different software (grafana - tempo - prometheus).

I agree with you that this would become a maintenance nightmare supporting two "different styles", we adopt as metric naming convention what stated in Metric and label naming:

For metrics specific to an application, the prefix is usually the application name itself

where we consider application the namespace (i.e.: ecto) and not the current app (specified in the mix.exs) that group all those extra_applications.

What inspires us to this way is this example in instrumentation:

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.

I think that before starting with the PoC you can imagine the result, which would be to make sure that it is thoroughly discussed with pros and cons about those two styles and maybe asking help directly to Prometheus developer in order to have some guidance about this issue.

Thank you for your time and support :pray:

akoutmos commented 2 years ago

I haven't forgotten about this issue :P. Still pondering on it and trying to think about how to proceed.

lucazulian commented 2 years ago

No problem, I understand that it's not a simple issue to reason about, if I can help in some ways don't hesitate to contact me. :pray:

lucazulian commented 2 years ago

I found this article about label cardinality, maybe this can help with this discussion

akoutmos commented 2 years ago

I think i have a good way to do this without making it difficult to maintain. I scheduled it for release 1.9. If you want to tackle this, lmk and we can sync up as to what I have in mind.

lucazulian commented 2 years ago

Yep, I would like to tackle this issue, thank you!

akoutmos commented 2 years ago

Awesome! That is much appreciated. When I have a moment in the next couple days, I'll jot down some notes here as that what I think could help move this feature along :).

akoutmos commented 2 years ago

Alrighty, as promised, here are some of my thoughts as to how I would possibly approach this problem (feel free to deviate from these suggestions as they are just suggestions 😄 ):

Add an additional default dashboard assign that specifies where the otp_app is in the time-series datapoint. Perhaps a config like otp_app_location: :name (default) and otp_app_location: {:label, :label_name} if you want the otp_app as a label. Perhaps a value of otp_app_location: :label can default to having :application be the default label name.
Add an additional option when users use PromEx to support the previous bullet point.
When the :otp_app_location option is set to {:label, :label_name}, go through all of the plugins and edit the Telemetry Metrics structs so that the :tag_values anonymous functions are wrapped in another anonymous function that takes the first result and appends the otp_app value (you would also need to edit the :tags list to include the :label_name value). Take a look at the docs for more info if you need it.
When the plugins are all being initialized, each plugin should receive a list of options that has a :metric_prefix of the name of the plugin as opposed to the otp_app + promex + plugin. This may warrant introducing an init/1 callback for the PromEx.Plugin behaviour so that this doesn't get computed with all of the metric type callbacks.
The Grafana dashboards should have their PromQL definitions extracted from the *.json.eex templates and instead put in separate modules where the functions in those modules return the correct PromQL query depending on the assigns that are provided (those assigns containing the :otp_app_location value so that you know which PromQL query to return).

Hopefully that helps guide you in the right direction. LMK if you have any questions or suggestions 😄!

HarshBalyan commented 2 years ago

@lucazulian @akoutmos any update on this one?

lucazulian commented 2 years ago

@lucazulian @akoutmos any update on this one?

Hey, @HarshBalyan, thanks for your interest. I'm so sorry, but I'm still working on this as it would be a not so easy issue.

lairtonlelis commented 2 years ago

If anyone else is interested in renaming their metric prefixes, this is how I did it:

I had to add a metric_prefix option to each plugin defined on my prom_ex.ex file's plugins function. So if you have the Phoenix plugin, for example, it would be something like this:
```
{Plugins.Phoenix, metric_prefix: [:my, :custom, :prefix, :here]},
```
I also had to add a custom prefix to my dashboard_assigns function. For the Phoenix example, it would be something like this:
```
phoenix_metric_prefix: ["my", "custom", "prefix", "here"],
```
This option is listed on all default plugins' docs: https://hexdocs.pm/prom_ex/PromEx.Plugins.Phoenix.html#module-plugin-options

I was also able to use a blank prefix on metrics by adding metric_prefix: [:""], which basically removes the prefixes. However, this doesn't work for the Grafana dashboards, as the generated metrics have a single _ as a prefix if you set them to blank.

tsloughter commented 1 year ago

If you are going to consider changes to the metric names (I know here it is only about the prefix, but just in case) it'd be great to collaborate. I was thinking it'd be fortunate if prom_ex's dashboards worked with OpenTelemetry metrics (https://opentelemetry.io/) by default and was looking at the metric names used as we prepare to define the metrics exported by OpenTelemetry.

See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md for an example of how these get defined, right now it just has the JVM.

It may just be that we use your existing metric names (minus the app name prefix), in which case we'll be compatible except for the prefix, so thought this issue was a good place to put this.

akoutmos / prom_ex

Metric and label naming #115