Open knkarthik opened 3 months ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I also tried the same config with GKE standard cluster and I'm running into https://github.com/actions/actions-runner-controller/issues/3132.
Hey @knkarthik,
I'm not sure that you are using the right service account. You should not use the service account of the controller, but rather the service account for the service account with the permissions you posted.
Thanks for the reply and sorry to confuse you @nikola-jokic.
I'm indeed using gke-autopilot-gha-rs-kube-mode
which has the necessary permissions as the service account, afaik.
The following is actually commented out in my values file but in my post it was not. I've removed it from my original post now to make it clear.
controllerServiceAccount:
namespace: actions
name: gha-runner-scale-set-controller-gha-rs-controller
Can you please monitor the cluster and run kubectl describe when the workflow pod is created?
@nikola-jokic I did some digging and unfortunately, the pod appears for < 1s and I'm not able to describe it. However, when I run kubectl events
, I get OutOfcpu
warning for the -workflow
pod. So this seems to be the same issue as https://github.com/actions/actions-runner-controller/discussions/2527 and https://github.com/kubernetes/kubernetes/issues/115325.
> kubectl get events -n actions
LAST SEEN TYPE REASON OBJECT MESSAGE
9m4s Normal WaitForPodScheduled persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work waiting for pod gke-autopilot-c4pk8-runner-hqz89 to be scheduled
9m3s Normal WaitForFirstConsumer persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work waiting for first consumer to be created before binding
9m4s Warning FailedScheduling pod/gke-autopilot-c4pk8-runner-hqz89 0/2 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "gke-autopilot-c4pk8-runner-hqz89-work". preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
12m Normal WaitForPodScheduled persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work waiting for pod gke-autopilot-c4pk8-runner-lxzqj to be scheduled
11m Normal ExternalProvisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
12m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal ProvisioningSucceeded persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work Successfully provisioned volume pvc-91216e22-4299-422f-977b-51f3fcb219e1
9m15s Warning OutOfcpu pod/gke-autopilot-c4pk8-runner-lxzqj-workflow Node didn't have enough resource: cpu, requested: 4000, used: 1849, capacity: 1930
11m Normal Scheduled pod/gke-autopilot-c4pk8-runner-lxzqj Successfully assigned actions/gke-autopilot-c4pk8-runner-lxzqj to gk3-autopilot-pov-pool-2-3bb9a724-7q2p
10m Warning FailedMount pod/gke-autopilot-c4pk8-runner-lxzqj MountVolume.SetUp failed for volume "pod-templates" : configmap "pod-templates" not found
11m Normal SuccessfulAttachVolume pod/gke-autopilot-c4pk8-runner-lxzqj AttachVolume.Attach succeeded for volume "pvc-91216e22-4299-422f-977b-51f3fcb219e1"
10m Normal Pulling pod/gke-autopilot-c4pk8-runner-lxzqj Pulling image "ghcr.io/actions/actions-runner:latest"
10m Normal Pulled pod/gke-autopilot-c4pk8-runner-lxzqj Successfully pulled image "ghcr.io/actions/actions-runner:latest" in 238.11642ms (238.134258ms including waiting)
10m Normal Created pod/gke-autopilot-c4pk8-runner-lxzqj Created container runner
10m Normal Started pod/gke-autopilot-c4pk8-runner-lxzqj Started container runner
@knkarthik, not sure if it is just that, but i managed to pass in resources for a gpu job with a confimap very similar to yours, just removing the comments on the $job name line.. i don't know if you added that just here, but might be worth trying without it..
mine looks like this.
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
annotated-by: "extension"
labels:
labeled-by: "extension"
spec:
containers:
- name: $job
resources:
limits:
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
Checks
Controller Version
0.8.3
Deployment Method
Helm
Checks
To Reproduce
runner-scale-set-values.yaml
pod-template.yaml
rbac,yaml
Describe the bug
I can see that a runner pod is created but it failed to create the job pod with the message
Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"
Describe the expected behavior
I expected it to create a job pod.
Additional Context
It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.
Controller Logs
Runner Pod Logs