Closed dillon-cullinan closed 5 months ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Closing in favor of https://github.com/actions/actions-runner-controller/issues/3450
Checks
Controller Version
0.9.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
The runner controller is killing a pod after 1 minute of being unable to obtain a node to run on. The workflow never starts and is left in a pending state, and there is no attempt to try again either.
After the pod is killed, the provisioned node is available shortly after, and cancelling + rerunning the workflow allows it to run properly.
It consistently happens at 1 minute every time, so I'm guessing its internal to the controller and is some kind of timeout. For the record, there is another bug similar to this related to the runner registration. If the docker image you are pulling takes too long, the controller revokes the registration causing the pod to die after the pull is finished, I can create another ticket for this if needed, but it seems to be very similar timeout behavior.
Describe the expected behavior
The controller should be more patient with nodes and docker pulls, or these timeouts should be configurable. This issue does not exist in 0.9.0. The workflow should also not be left in a pending state. If the controller gives up on obtaining a pod then the workflow should be cancelled or the controller should retry.
Additional Context
Exact
values.yaml
used for runner scale set. Only requirement to reproduce both described issues are a large image and a node that takes longer than 1 minutes to spin up. Other values are meaningless.Controller Logs
https://gist.github.com/dillon-cullinan/db470ee50ab1b411589142d907764e9c
Runner Pod Logs
Describe Logs
https://gist.github.com/dillon-cullinan/8fafe89e61e325c6f82db977e7d52e7c
Pod Logs
None, the pod is unable to obtain a node