actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.63k stars 1.1k forks source link

gha_job_queue_duration_seconds metric missing #3423

Closed laban-eilers-simplisafe closed 5 months ago

laban-eilers-simplisafe commented 5 months ago

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

To Reproduce

1. Enable metrics in the gha-runner-scale-set-controller helm chart and install
2. Create a test pod with curl installed in the same cluster as arc, and shell into it with kubectl exec
3. Get the IP of any listener pod
4. Run `curl http://<pod-ip>:8080/metrics | grep gha_job_startup_duration_seconds`: validate that you see metrics emitted
5. Run `curl http://<pod-ip>:8080/metrics | grep gha_job_queue_duration_seconds`: no results

Describe the bug

The histogram metric gha_job_queue_duration_seconds, which is documented in the official Github ARC docs, is not being emitted by listener pods.

I've validated that the other documented metrics in the same family are, in fact emitted, including gha_job_startup_duration_seconds and gha_job_execution_duration_seconds.

I can see the metric name exists in the code, but its implementation is commented out: https://github.com/actions/actions-runner-controller/blob/d3a8a34bb22ed8c9d1c91b6047854366bedad1bf/cmd/githubrunnerscalesetlistener/metrics.go#L166

There's no specific explanation of why it's commented in the code or in the original PR, so I wasn't sure if a PR to un-comment it would be welcome.

Describe the expected behavior

I would expect the metric gha_job_queue_duration_seconds to be emitted from listener pods, per the documentation.

If the metric has been disabled for a reason, I'd expect that reason to be commented in the source code, so I would have some context to submit a PR to fix it.

Additional Context

Values file:

replicaCount: 1
flags:
  watchSingleNamespace: actions-runners
metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"
  listenerEndpoint: "/metrics"

### Controller Logs

```shell
NOTE: I don't think these logs are relevant, but here's a snippet:

2024-04-09T16:33:46Z    INFO    EphemeralRunnerSet  Scaling comparison  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "current": 3, "desired": 3}
2024-04-09T16:33:46Z    INFO    EphemeralRunnerSet  Updating status with current runners count  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "count": 3}
2024-04-09T16:33:46Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"firmware-xl7pg-runner-hd44j","namespace":"actions-runners"}}
2024-04-09T16:33:46Z    INFO    EphemeralRunnerSet  Ephemeral runner counts {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "pending": 1, "running": 2, "finished": 0, "failed": 0, "deleting": 0}
2024-04-09T16:33:46Z    INFO    EphemeralRunnerSet  Scaling comparison  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "current": 3, "desired": 3}
2024-04-09T16:33:46Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set  {"autoscalingrunnerset": {"name":"firmware","namespace":"actions-runners"}, "name": "firmware-xl7pg", "specHash": "77ff47d775"}
2024-04-09T16:33:54Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"firmware-xl7pg-runner-qsls5","namespace":"actions-runners"}}
2024-04-09T16:33:55Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"firmware-xl7pg-runner-qsls5","namespace":"actions-runners"}}
2024-04-09T16:33:55Z    INFO    EphemeralRunner Updating ephemeral runner status with pod phase {"ephemeralrunner": {"name":"firmware-xl7pg-runner-qsls5","namespace":"actions-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": ""}
2024-04-09T16:33:55Z    INFO    EphemeralRunner Updated ephemeral runner status with pod phase  {"ephemeralrunner": {"name":"firmware-xl7pg-runner-qsls5","namespace":"actions-runners"}}
2024-04-09T16:33:55Z    INFO    EphemeralRunnerSet  Ephemeral runner counts {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "pending": 0, "running": 3, "finished": 0, "failed": 0, "deleting": 0}
2024-04-09T16:33:55Z    INFO    EphemeralRunnerSet  Scaling comparison  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "current": 3, "desired": 3}
2024-04-09T16:33:55Z    INFO    EphemeralRunnerSet  Updating status with current runners count  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "count": 3}
2024-04-09T16:33:55Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"firmware-xl7pg-runner-qsls5","namespace":"actions-runners"}}
2024-04-09T16:33:55Z    INFO    EphemeralRunnerSet  Ephemeral runner counts {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "pending": 0, "running": 3, "finished": 0, "failed": 0, "deleting": 0}
2024-04-09T16:33:55Z    INFO    EphemeralRunnerSet  Scaling comparison  {"ephemeralrunnerset": {"name":"firmware-xl7pg","namespace":"actions-runners"}, "current": 3, "desired": 3}
2024-04-09T16:33:55Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set  {"autoscalingrunnerset": {"name":"firmware","namespace":"actions-runners"}, "name": "firmware-xl7pg", "specHash": "77ff47d775"}

### Runner Pod Logs

```shell
Also not relevant, but here's a snippet:

[RUNNER 2024-04-09 16:33:47Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2024-04-09 16:33:47Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:47Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2024-04-09 16:33:47Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2024-04-09 16:33:47Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:47Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2024-04-09 16:33:47Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2024-04-09 16:33:47Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:48Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:48Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:48Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-09 16:33:48Z INFO MessageListener] VssConnection created
[RUNNER 2024-04-09 16:33:48Z INFO Terminal] WRITE LINE:

√ Connected to GitHub
[RUNNER 2024-04-09 16:33:48Z INFO Terminal] WRITE LINE:

[RUNNER 2024-04-09 16:33:49Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2024-04-09 16:33:49Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2024-04-09 16:33:49Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2024-04-09 16:33:49Z INFO MessageListener] Session created.
[RUNNER 2024-04-09 16:33:49Z INFO Terminal] WRITE LINE: Current runner version: '2.315.0'
Current runner version: '2.315.0'
[RUNNER 2024-04-09 16:33:49Z INFO Terminal] WRITE LINE: 2024-04-09 16:33:49Z: Listening for Jobs
2024-04-09 16:33:49Z: Listening for Jobs
[RUNNER 2024-04-09 16:33:49Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
github-actions[bot] commented 5 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

wwalters12 commented 5 months ago

@laban-eilers-simplisafe For some additional context, here is a comment on the PR that introduced the commented out gha_job_queue_duration_seconds metric: https://github.com/actions/actions-runner-controller/pull/2559#issuecomment-1792155499

nikola-jokic commented 5 months ago

Thank you @wwalters12 for linking my comment!

I will close this one for now. This metric should be added in the future, but I cannot say for sure when.