actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

ghalistener high cardinality metrics #3670

Open christophermichaeljohnston opened 4 months ago

christophermichaeljohnston commented 4 months ago

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

To Reproduce

All actions scheduled by ghalistener use a new runner causing a new metric for every single action. This is because the metrics include runner_id and runner_name which is distinct for every run. For example:

gha_completed_jobs_total{<snip>,runner_id="71363",runner_name="self-hosted-linux-x64-zfhfn-runner-k752n"} 1
gha_completed_jobs_total{<snip>,runner_id="71369",runner_name="self-hosted-linux-x64-zfhfn-runner-pr56c"} 1
gha_completed_jobs_total{<snip>,runner_id="71376",runner_name="self-hosted-linux-x64-zfhfn-runner-qns9x"} 1

The <snip> labels above are identical for the same workflow, but there is a new metric for each action due to runner_id and runner_name being unique.

This also causes memory and cpu usage to continually creep as the listener must keep track of all these metrics, even though it will never update them, due to the unique labels.

Describe the bug

^ see above

This was fixed in githubrunnerscalesetlistener in #3003 and the fix needs to be included in ghalistener.

Describe the expected behavior

Metrics should not include labels are unique as this causes high cardinality and renders the counters, which will only have a value of 1, as unusuable.

Additional Context

n/a

Controller Logs

n/a

Runner Pod Logs

n/a
github-actions[bot] commented 4 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

christophermichaeljohnston commented 4 months ago

Screenshot 2024-07-19 at 9 14 02 AM

^^ gha listener memory and cpu usage increase caused by tracking of high cardinality metrics

iwaffles commented 2 months ago

Related to #3153

christophermichaeljohnston commented 4 weeks ago

Included in the attached PR for removal is job_workflow_ref which also causes horrible high cardinality