[X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202
If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up.
1. Create a runner set with a max amount of any number of runners
2. Fail the runners and let them be marked as failed to approach the runner maximum
3. Try spinning up new runners and you will see the failed runners take up space blocking new runners from starting or capping the amount of new runners we can spin up
If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up. We need this to be configurable and somehow clean the failed runners after sometime as well.
Describe the expected behavior
The expected behavior we want is to set the failure threshold so that we can buy more time to catch these failed ephemeral runners. Something like this would be great:
case len(ephemeralRunner.Status.Failures) > failedRetryLimit:
We should be able to set it in the helm chart for the actions runner controller. And if the controller automatically cleaned the failed runners that would be great as well maybe once a day or something.
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Related to this issue: https://github.com/actions/actions-runner-controller/discussions/3300
Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202
If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up. We need this to be configurable and somehow clean the failed runners after sometime as well.
Describe the expected behavior
The expected behavior we want is to set the failure threshold so that we can buy more time to catch these failed ephemeral runners. Something like this would be great:
We should be able to set it in the helm chart for the actions runner controller. And if the controller automatically cleaned the failed runners that would be great as well maybe once a day or something.
Additional Context
Controller Logs
Runner Pod Logs