argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.76k stars 867 forks source link

Controller metrics: Add additional labels to time series from custom resources #3598

Open goredar opened 5 months ago

goredar commented 5 months ago

Summary

Hey! Firstly, thank you for the amazing project and all your efforts!

Currently, the only distinguishable label for the majority of metrics is name (see example below). That might prevent one from building aggregation queries using PromQL language. Also, resource interconnections are impossible to grasp (e.g. from which Analysis Template was the Analysis Run build).

# HELP analysis_run_info Information about analysis run.
# TYPE analysis_run_info gauge
analysis_run_info{name="service-6c3ba6ad58-1",namespace="argocd",phase="Successful"} 1

One possible solution would be to allow configurable label pass though from the Kubernetes Custom Resources to the exposed metrics, for example:

analysis_run_info{app="app1",rollout="web-service",name="service-6c3ba6ad58-1",namespace="argocd",phase="Successful",template="app-error-rate"} 1

Use Cases

Imaging the application having several rollouts, we'd like to have an overall metric for deployment success rate across multiple Rollouts and Analysis Runs.

Thanks in advance for your cooperation!

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

chetan-rns commented 4 months ago

@goredar @zachaller I would recommend a separate metric for exposing labels that could be joined with the existing metrics. This approach would be similar to what we have in argocd. WDYT?

https://argo-cd.readthedocs.io/en/latest/operator-manual/metrics/#exposing-application-labels-as-prometheus-metrics

goredar commented 4 months ago

It's a good idea, actually. Thanks @chetan-rns!

nadavbuc commented 3 months ago

+1 i have this query for deployments

max(kube_deployment_labels{cluster="eks-dev",namespace="default",label_slack_contact!=""}) by (deployment, label_slack_contact) * on (deployment) group_left (deployment) max (kube_deployment_status_condition{cluster="eks-dev",namespace="default",status="false"} > 0) by (deployment)

Having a similar metric for rollout objects would be extremely helpful