actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.62k stars 1.1k forks source link

gha_job_execution_duration_seconds Prometheus metrics cannot be used for histograms #3555

Closed thomassandslyst closed 4 months ago

thomassandslyst commented 4 months ago

Checks

Controller Version

0.9.2

Deployment Method

ArgoCD

Checks

To Reproduce

1. Install the gha-runner-scale-set controller and a listener and successfully run jobs.
2. Enable controller and listener metrics.
3. Try and create a histogram of startup or execution time using the standard Prometheus functions

Describe the bug

The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.

Prometheus is unable to aggregate metrics before applying rate() on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.

Describe the expected behavior

The gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics should have less labels as to reduce cardinality.

Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as runner_id, runner_name, and job_workflow_ref should be removed.

Additional Context

N/A

Controller Logs

It says never to omit but this issue doesn't relate to controller logs.
N/A

Runner Pod Logs

N/A
thomassandslyst commented 4 months ago

https://github.com/actions/actions-runner-controller/issues/3153 Relates to this, but that issue is from a performance perspective and this one is from a usability perspective.

nikola-jokic commented 4 months ago

Hey @thomassandslyst,

As you pointed out, this is the same issue. The issue you submitted provides more context for why we need to change it, but it does not represent a different problem.

Closing this one.