Job pod failed to start on GKE Autopilot with container hooks (kubernetes mode)

knkarthik commented 3 months ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.8.3

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

runner-scale-set-values.yaml

githubConfigUrl: "https://github.com/my/repo"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
template:
  spec:
    securityContext:
      fsGroup: 1001
    serviceAccountName: gke-autopilot-gha-rs-kube-mode
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 4Gi
      - name: pod-templates
        configMap:
          name: pod-templates
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command:
          - /home/runner/run.sh
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
             value: /home/runner/pod-templates/default.yaml
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
            value: actions-runner-controller/0.8.3
        resources:
          requests:
            cpu: 250m
            memory: 1Gi
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true

pod-template.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-templates
data:
  default.yaml: |
    ---
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        annotated-by: "extension"
      labels:
        labeled-by: "extension"
    spec:
      serviceAccountName: gke-autopilot-gha-rs-kube-mode
      securityContext:
        fsGroup: 1001
      containers:
        - name: $job # overwrites job container
          resources:
            requests:
              cpu: "3800m"
              memory: "4500"

rbac,yaml

---
# Source: gha-runner-scale-set/templates/kube_mode_serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions

# Source: gha-runner-scale-set/templates/kube_mode_role.yaml
# default permission for runner pod service account in kubernetes mode (container hook)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["get", "create"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "create", "delete"]
---
# Source: gha-runner-scale-set/templates/kube_mode_role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gke-autopilot-gha-rs-kube-mode
subjects:
  - kind: ServiceAccount
    name: gke-autopilot-gha-rs-kube-mode
    namespace: actions
---

Describe the bug

I can see that a runner pod is created but it failed to create the job pod with the message Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"

Describe the expected behavior

I expected it to create a job pod.

Additional Context

It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.


githubConfigUrl: "https://github.com/my/org"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    resources:
      requests:
        storage: 4Gi
template:
  spec:
    securityContext:
      fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]

controllerServiceAccount:
  namespace: actions
  name: gha-runner-scale-set-controller-gha-rs-controller

Controller Logs

No errors, just regular logs. I can provide it if required.

Runner Pod Logs

[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "action": "Pre Job Hook",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "type": "runner",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stage": "Pre",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stepId": "06f9adc3-e79d-405b-91eb-a7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stepContextName": "06f9adc3e79d405b91eba7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "result": "failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "errorMessages": [
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Process completed with exit code 1.",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   ],
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "executionTimeInSeconds": 42,
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "startTime": "2024-03-27T15:18:57.1056563Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "finishTime": "2024-03-27T15:19:38.206926Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "containerHookData": "{\"hookScriptPath\":\"/home/runner/k8s/index.js\"}"
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] }.
[WORKER 2024-03-27 15:19:38Z INFO StepsRunner] Update job result with current step result 'Failed'.

github-actions[bot] commented 3 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

knkarthik commented 3 months ago

I also tried the same config with GKE standard cluster and I'm running into https://github.com/actions/actions-runner-controller/issues/3132.

nikola-jokic commented 3 months ago

Hey @knkarthik,

I'm not sure that you are using the right service account. You should not use the service account of the controller, but rather the service account for the service account with the permissions you posted.

knkarthik commented 3 months ago

Thanks for the reply and sorry to confuse you @nikola-jokic.

I'm indeed using gke-autopilot-gha-rs-kube-mode which has the necessary permissions as the service account, afaik.

The following is actually commented out in my values file but in my post it was not. I've removed it from my original post now to make it clear.

controllerServiceAccount:
  namespace: actions
  name: gha-runner-scale-set-controller-gha-rs-controller

nikola-jokic commented 3 months ago

Can you please monitor the cluster and run kubectl describe when the workflow pod is created?

knkarthik commented 3 months ago

@nikola-jokic I did some digging and unfortunately, the pod appears for < 1s and I'm not able to describe it. However, when I run kubectl events, I get OutOfcpu warning for the -workflow pod. So this seems to be the same issue as https://github.com/actions/actions-runner-controller/discussions/2527 and https://github.com/kubernetes/kubernetes/issues/115325.

> kubectl get events -n actions

LAST SEEN   TYPE      REASON                   OBJECT                                                        MESSAGE
9m4s        Normal    WaitForPodScheduled      persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work   waiting for pod gke-autopilot-c4pk8-runner-hqz89 to be scheduled
9m3s        Normal    WaitForFirstConsumer     persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work   waiting for first consumer to be created before binding
9m4s        Warning   FailedScheduling         pod/gke-autopilot-c4pk8-runner-hqz89                          0/2 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "gke-autopilot-c4pk8-runner-hqz89-work". preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
12m         Normal    WaitForPodScheduled      persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   waiting for pod gke-autopilot-c4pk8-runner-lxzqj to be scheduled
11m         Normal    ExternalProvisioning     persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
12m         Normal    Provisioning             persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m         Normal    Provisioning             persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m         Normal    Provisioning             persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m         Normal    Provisioning             persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m         Normal    ProvisioningSucceeded    persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work   Successfully provisioned volume pvc-91216e22-4299-422f-977b-51f3fcb219e1
9m15s       Warning   OutOfcpu                 pod/gke-autopilot-c4pk8-runner-lxzqj-workflow                 Node didn't have enough resource: cpu, requested: 4000, used: 1849, capacity: 1930
11m         Normal    Scheduled                pod/gke-autopilot-c4pk8-runner-lxzqj                          Successfully assigned actions/gke-autopilot-c4pk8-runner-lxzqj to gk3-autopilot-pov-pool-2-3bb9a724-7q2p
10m         Warning   FailedMount              pod/gke-autopilot-c4pk8-runner-lxzqj                          MountVolume.SetUp failed for volume "pod-templates" : configmap "pod-templates" not found
11m         Normal    SuccessfulAttachVolume   pod/gke-autopilot-c4pk8-runner-lxzqj                          AttachVolume.Attach succeeded for volume "pvc-91216e22-4299-422f-977b-51f3fcb219e1"
10m         Normal    Pulling                  pod/gke-autopilot-c4pk8-runner-lxzqj                          Pulling image "ghcr.io/actions/actions-runner:latest"
10m         Normal    Pulled                   pod/gke-autopilot-c4pk8-runner-lxzqj                          Successfully pulled image "ghcr.io/actions/actions-runner:latest" in 238.11642ms (238.134258ms including waiting)
10m         Normal    Created                  pod/gke-autopilot-c4pk8-runner-lxzqj                          Created container runner
10m         Normal    Started                  pod/gke-autopilot-c4pk8-runner-lxzqj                          Started container runner

c-fteixeira commented 2 months ago

@knkarthik, not sure if it is just that, but i managed to pass in resources for a gpu job with a confimap very similar to yours, just removing the comments on the $job name line.. i don't know if you added that just here, but might be worth trying without it..

mine looks like this.

apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-templates
data:
  default.yaml: |
    ---
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        annotated-by: "extension"
      labels:
        labeled-by: "extension"
    spec:
      containers:
        - name: $job
          resources:
            limits:
              nvidia.com/gpu: "1"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4

actions / runner-container-hooks