Docker Container Action Jobs failing to schedule on autoscaled cluster

rteeling-evernorth commented 7 months ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.7.0,0.8.2

Deployment Method

ArgoCD

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Use k8s Cluster Autoscaler
2. Create ScaleSet using Kubernetes mode
3. Run a docker container-based action
3. The cluster must not have the capacity to schedule the action's job pod

Describe the bug

When the k8s job pod tries to run, the k8s cannot find a node to schedule and throws the following event/error: Node didn't have enough resource: cpu, requested: 2000, used: 13920, capacity: 15890

The K8S Job has the following error on it: Job has reached the specified backoff limit

This causes the Actions job to fail

Describe the expected behavior

Job pod should wait for new nodes to come online to schedule (average: 45 seconds)

Additional Context

gha-runner-scale-set:

  githubConfigUrl: changeme

  githubConfigSecret: github-arc-secret

  minRunners: 0
  runnerGroup: "changeme"

  runnerScaleSetName: "changeme"

  githubServerTLS:
    certificateFrom:
      configMapKeyRef:
        name: my-cacert
        key: ca.crt
    runnerMountPath: /usr/local/share/ca-certificates/

  containerMode:
    type: "kubernetes"  ## type can be set to dind or kubernetes
    kubernetesModeWorkVolumeClaim:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp2-encrypted"
      resources:
        requests:
          storage: 5Gi
  template:
    ### CUSTOM ###
    spec:
      nodeSelector:
        github: "true"
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: github
      priorityClassName: github
      ### END CUSTOM ###
      securityContext:
        fsGroup: 123
      containers:
      - name: runner
        # image: ghcr.io/actions/actions-runner:latest
        image: ACTIONS-RUNNER-IMAGE-MIRROR/actions-runner:2.314.0
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            cpu: "200m"
            memory: "512Mi"

        env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-templates/default.yaml
        volumeMounts:
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true
      volumes:
      - name: pod-templates
        configMap:
          name: pod-templates

  controllerServiceAccount:

    namespace: arc-system
    name: github-actions-controller-gha-rs-controller

Controller Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request

Runner Pod Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request.

github-actions[bot] commented 7 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 7 months ago

Hey @rteeling-evernorth,

This issue is related to the hook :relaxed:. Are you using default hook implementation in your container mirror? If so, job schedules the pod to run on the same node where the runner is. If so, the problem is with the node capacity, not with the scheduler. By default, we are skipping the scheduler so we can use the volume mount from the runner pod. This can be avoided in case you use ReadWriteMany volumes, but would require you to configure envs appropriately.

rteeling-evernorth commented 7 months ago

Ah! That would explain it. Everything in my mirror is off-the-shelf for 0.8.2. I was using the default volume mount in the values file which is ReadWriteOnce. This would compel the behavior I am seeing. Thank you so much for the info!

nikola-jokic commented 7 months ago

You are welcome!

actions / runner-container-hooks