workflow runner pods fail instantly if pods are unschedulable

jonathan-fileread commented 1 week ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.9.2

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Install ARC Controller + Runner set 0.9.2
define ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE with the podTemplate, and containerMode: "Kubernetes"
define a pod template like this

apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Describe the bug

GHA jobs fail instantly if a pod is unscheduable due to waiting for node to become available (if the resource request for CPU/Memory is high, waiting for the node autoscaler)

Screenshot 2024-07-05 at 5 13 03 PM

Describe the expected behavior

There should be a timeout field either in the runner set or container hooks podtemplate that allows the workflow pod to wait for x minutes till the pod is scheduled after another node is alive.

Additional Context

template:
  spec:
    initContainers:
      - name: kube-init
        image: ghcr.io/actions/actions-runner:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            sudo chown -R 1001:123 /home/runner/_work
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    securityContext:
      fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yml
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
        volumeMounts:
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "managed-csi"
              resources:
                requests:
                  storage: ${local.volume_claim_size}
      - name: pod-templates
        configMap:
          name: "runner-pod-template"

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "managed-csi"
    resources:
      requests:
        storage: 50Gi

Pod Template YAML:
apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Controller Logs

https://gist.github.com/jonathan-fileread/602f6d5fd948bf505a2fa7f5dbd78069

Runner Pod Logs

https://gist.githubusercontent.com/jonathan-fileread/96db9941abc5faba985aae78ef6b3760/raw/196644c97c7698e51bf6ae9b50dbf769dd4f1825/gistfile1.txt

github-actions[bot] commented 1 week ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

ropelli commented 4 days ago

actions / actions-runner-controller