actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Add Customizable Failure Threshold for Ephemeral Runner Retries #3700

Open ali-kafel opened 3 months ago

ali-kafel commented 3 months ago

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

To Reproduce

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up.

1. Create a runner set with a max amount of any number of runners
2. Fail the runners and let them be marked as failed to approach the runner maximum
3. Try spinning up new runners and you will see the failed runners take up space blocking new runners from starting or capping the amount of new runners we can spin up

Describe the bug

Related to this issue: https://github.com/actions/actions-runner-controller/discussions/3300

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up. We need this to be configurable and somehow clean the failed runners after sometime as well.

Describe the expected behavior

The expected behavior we want is to set the failure threshold so that we can buy more time to catch these failed ephemeral runners. Something like this would be great:

case len(ephemeralRunner.Status.Failures) > failedRetryLimit:

We should be able to set it in the helm chart for the actions runner controller. And if the controller automatically cleaned the failed runners that would be great as well maybe once a day or something.

Additional Context

N/A

Controller Logs

N/A

Runner Pod Logs

N/A