Closed thomassandslyst closed 6 months ago
https://github.com/actions/actions-runner-controller/issues/3153 Relates to this, but that issue is from a performance perspective and this one is from a usability perspective.
Hey @thomassandslyst,
As you pointed out, this is the same issue. The issue you submitted provides more context for why we need to change it, but it does not represent a different problem.
Closing this one.
Checks
Controller Version
0.9.2
Deployment Method
ArgoCD
Checks
To Reproduce
Describe the bug
The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.
Prometheus is unable to aggregate metrics before applying
rate()
on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.Describe the expected behavior
The gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics should have less labels as to reduce cardinality.
Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as
runner_id
,runner_name
, andjob_workflow_ref
should be removed.Additional Context
Controller Logs
Runner Pod Logs