actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Removes high-cardinality labels from histogram metrics #3556

Open thomassandslyst opened 6 months ago

thomassandslyst commented 6 months ago

This is to solve https://github.com/actions/actions-runner-controller/issues/3153

This removes runner_id, runner_name, and job_workflow_ref from the job_startup_duration_seconds and job_execution_duration_seconds metrics to reduce cardinality and allow histograms to be produced from them, with the idea that startup and execution data will be stored in "per repo + workflow" buckets.

I'm unsure whether removing labelKeyJobWorkflowRef from jobLabels is suitable or if this should be reworked more to come up with more suitable lists.

thomassandslyst commented 4 months ago

Any nudge on this? Is there anything you'd like me to do to get this sorted?

wwalters12 commented 4 months ago

I would love to see this change, it would make tracking workflow execution times in Grafana much easier 🙏

mikespharss commented 1 month ago

plus one, please review and accept this PR. Emitting high cardinality metrics like this is explicitly discouraged by the prometheus/client_golang maintainers. This also effectively leads to unbounded memory growth unless pods are restarted.

https://github.com/prometheus/client_golang/issues/748 https://github.com/prometheus/client_golang/discussions/920