actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.4k stars 1.04k forks source link

Invalid values for the metrics gha_registered_runners and gha_idle_runners in ghalistener #3546

Open verdel opened 1 month ago

verdel commented 1 month ago

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

To Reproduce

-

Describe the bug

We have a GitHub Action that runs once a day. A special type of runners is allocated specifically for it. During the execution of the GitHub Action, we receive the latest batch of messages about task execution. In this message, the statistics.totalIdleRunners and statistics.totalRegisteredRunners contain non-zero values.

These values are published by the controller as a prometheus metrics. After this last message, the metric values do not change until the next runner execution the following day.

Is it possible to fix this behavior, or does it require changes on the GitHub side?

Describe the expected behavior

The value of the Prometheus metrics ghalistener should reflect the actual state of the runners.

Additional Context

-

Controller Logs

2024-05-26T07:24:29Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 1089}
2024-05-26T07:24:37Z    INFO    listener-app.listener   Processing message  {"messageId": 1090, "messageType": "RunnerScaleSetJobMessages"}
2024-05-26T07:24:37Z    INFO    listener-app.listener   New runner scale set statistics.    {"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":3,"totalAssignedJobs":3,"totalRunningJobs":3,"totalRegisteredRunners":4,"totalBusyRunners":3,"totalIdleRunners":0}}
2024-05-26T07:24:37Z    INFO    listener-app.listener   Job completed message received. {"RequestId": 669571, "Result": "succeeded", "RunnerId": 83622, "RunnerName": "terraform-drift-checker-hxppm-runner-9q2hx"}
2024-05-26T07:24:37Z    INFO    listener-app.listener   Deleting last message   {"lastMessageID": 1090}
2024-05-26T07:24:38Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 3, "decision": 3, "min": 0, "max": 30, "currentRunnerCount": 3, "jobsCompleted": 1}
2024-05-26T07:24:38Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":3,\"patchID\":6917,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-05-26T07:24:38Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":6917,\"replicas\":3}}"}
2024-05-26T07:24:38Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "github-actions-runner", "name": "terraform-drift-checker-hxppm", "replicas": 3}
2024-05-26T07:24:38Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 1090}
2024-05-26T07:24:50Z    INFO    listener-app.listener   Processing message  {"messageId": 1091, "messageType": "RunnerScaleSetJobMessages"}
2024-05-26T07:24:50Z    INFO    listener-app.listener   New runner scale set statistics.    {"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":2,"totalBusyRunners":0,"totalIdleRunners":1}}
2024-05-26T07:24:50Z    INFO    listener-app.listener   Job completed message received. {"RequestId": 669572, "Result": "succeeded", "RunnerId": 83625, "RunnerName": "terraform-drift-checker-hxppm-runner-6dmvl"}
2024-05-26T07:24:50Z    INFO    listener-app.listener   Job completed message received. {"RequestId": 669573, "Result": "succeeded", "RunnerId": 83623, "RunnerName": "terraform-drift-checker-hxppm-runner-lcc6k"}
2024-05-26T07:24:50Z    INFO    listener-app.listener   Job completed message received. {"RequestId": 669574, "Result": "succeeded", "RunnerId": 83624, "RunnerName": "terraform-drift-checker-hxppm-runner-d2bv7"}
2024-05-26T07:24:50Z    INFO    listener-app.listener   Deleting last message   {"lastMessageID": 1091}

Runner Pod Logs

-
github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 month ago

Hey @verdel,

You are right, we receive an empty batch if no activity is needed, so the metric would be incorrect when the cluster becomes idle. Ideally, to reflect the correct metric, the changes should be made on the API side. However, we can optimistically set this metric to the desired count when the cluster becomes idle. Let me discuss it with the team, and I'll get back to you with more information :relaxed: