Closed sofiegonzalez closed 1 month ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I was able to spin up a
spec:
serviceAccount: gha-runner-scale-set-gha-rs-kube-mode
I got the solution from this comment. i dont understand why this fixed my issue, as the pod already has this service account definition in the pod spec on the cluster:
...
securityContext: {}
serviceAccount: gha-runner-scale-set-gha-rs-kube-mode
serviceAccountName: gha-runner-scale-set-gha-rs-kube-mode
...
Hey @sofiegonzalez,
Can you please show the AutoscalingRunnerSet
yaml definition when you don't specify the service account? I applied the similar spec you are using, and I was able to run the workflow pod.
Hey @nikola-jokic sorry for the late response
yes, so i removed the serviceAccount: gha-runner-scale-set-gha-rs-kube-mode
annotation from the values.yaml and re-applied it to the cluster. The AutoscalingRunnerSet yaml looks like this for the runners in kubernetes mode: https://gist.github.com/sofiegonzalez/36108f31678e4113f6911d489e1a780d
this is what the AutoscalingRunnerSet looked like previously with the service account annotation: https://gist.github.com/sofiegonzalez/a9a8e447924294d060533ea472f6557e
No worries @sofiegonzalez!
I'm glad that you resolved the problem, but I don't understand why having serviceAccount
field fixes the issue. The field has been deprecated, so my best guess is that either old service account is being used during upgrade, or there is a problem with the older kubernetes service.
Can you please try to install the new scale set without serviceAccount field. A fresh install, not an upgrade. If it works, then I might know what the problem is. I cannot reproduce this issue so I'm trying my best to understand it from the description.
What do you mean by old service account and older kubernetes service? The service account I am referencing is the one created by the gha-runner-scale-set helm chart. We are on kubernetes v1.27.
I will try a fresh install without the serviceAccount field and update here, but I'm not going to do a fresh install of the gha-runner-scale-set-controller chart unless you think I need to.
Hey @nikola-jokic just did a fresh install. Here is the values.yaml i used: https://gist.github.com/sofiegonzalez/bc12dd21217bdbba392c481b644527eb and an example workflow i created to run a personal container image in a job: https://gist.github.com/sofiegonzalez/16ae560f6ff3072c754b0eabc1c2850f
This time the workflow pod was able to initialize and run my personal container. I really don't understand what changed, before I had done upgrades and fresh installs when trying to get the workflow pod to start up.
I think I have an idea what the problem was. When doing upgrades, sometimes, removing additional resources can take a lot of time. This problem is fixed with this PR. When you did the upgrade, the resource was probably not completely removed. Now, after upgrading it, the role associated with that service account was probably in a bad state, causing no tokens to be mounted on the pod, and therefore lacking permissions.
That is the reason I asked you to do a fresh install :relaxed:. This should be fixed now, since we merged the PR I linked above.
That makes sense, thanks for the clarification!
No worries! let's close this issue now and we can re-open it if you find that something else is a problem, especially since it works with the fresh install, and the PR I linked is already merged. Thank you for providing this information! The stuff written here and the stuff written on the container hook issue helped me better understand the issue.
Checks
Controller Version
latest
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Hi, My main issue is that the CI fails when I try to start a container job in
containerMode: kubernetes
with the errorError: HttpError: HTTP request failed
. This is blocking us from making progress.I have followed the github actions scale sets video on youtube, and tried to recreate the same configurations. The main difference being that I am using a PVC I have created through a manifest and am applying that with terraform. I am also using an docker image we built from a public docker repo, it is pull-able without authentication.
Right as the container job starts, the pod dies and fails to initialize. I can see the PVC was bound correctly. I am not sure what the
Error: HttpError: HTTP request failed
error means or what it is referring to.Describe the expected behavior
The container job should start up and create a-workflow pod to run the container.
Additional Context
Controller Logs
Runner Pod Logs