actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.57k stars 1.08k forks source link

Failed EphemeralRunners block launching new pods #3685

Open igaskin opened 1 month ago

igaskin commented 1 month ago

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

To Reproduce

1. Trigger a `FailedScheduling` event.
2. Wait for 5 failures in pod scheduling.
3. Recover the cluster.
4. New ephemeral runner pods will not be scheduled to meet capacity.

Describe the bug

When EphemeralRunners are in Failed state they get stuck in that state, which prevents other pods from being launched. This issue has been previously noted in these discussions.

status:
  currentRunners: 17
  failedEphemeralRunners: 16
  pendingEphemeralRunners: 0
  runningEphemeralRunners: 1 

https://github.com/actions/actions-runner-controller/discussions/3300 https://github.com/actions/actions-runner-controller/discussions/3610

Describe the expected behavior

Failed Ephemeral runners will be cleared, so scheduling can be retired.

Additional Context

https://github.com/actions/actions-runner-controller/discussions/3610
https://github.com/actions/actions-runner-controller/discussions/3300

Controller Logs

2024-06-20T19:18:03Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "my-scaleset-ns", "name": "my-runner-6pzbd", "replicas": 3}
2024-06-20T19:18:03Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 11}
2024-06-20T19:18:11Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 14}
2024-06-20T19:18:53Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 11}
2024-06-20T19:19:01Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 14}

Runner Pod Logs

2024-06-21T16:22:44Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "my-scaleset", "name": "my-runner-rpvp2", "replicas": 10}
github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

singlewind commented 1 month ago

This happened on me recently as well when I upgrade to 0.9.3 with github application. My situation is all the ephemeral runners were stuck in state of terminating status.