actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.63k stars 1.1k forks source link

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Closed dillon-cullinan closed 5 months ago

dillon-cullinan commented 5 months ago

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

To Reproduce

1. Installed runner controller based on docs: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller
2. Deploy a runner set based on docs with a slightly custom values.yaml, requires minRunners to be 0 for minimal testing

Describe the bug

The runner controller is killing a pod after 1 minute of being unable to obtain a node to run on. The workflow never starts and is left in a pending state, and there is no attempt to try again either.

After the pod is killed, the provisioned node is available shortly after, and cancelling + rerunning the workflow allows it to run properly.

It consistently happens at 1 minute every time, so I'm guessing its internal to the controller and is some kind of timeout. For the record, there is another bug similar to this related to the runner registration. If the docker image you are pulling takes too long, the controller revokes the registration causing the pod to die after the pull is finished, I can create another ticket for this if needed, but it seems to be very similar timeout behavior.

Describe the expected behavior

The controller should be more patient with nodes and docker pulls, or these timeouts should be configurable. This issue does not exist in 0.9.0. The workflow should also not be left in a pending state. If the controller gives up on obtaining a pod then the workflow should be cancelled or the controller should retry.

Additional Context

Exact values.yaml used for runner scale set. Only requirement to reproduce both described issues are a large image and a node that takes longer than 1 minutes to spin up. Other values are meaningless.

---
runnerScaleSetName: <redacted>
githubConfigUrl: <redacted>
githubConfigSecret: <redacted>
maxRunners: 16
minRunners: 0
metadata:
  name: <redacted>
  namespace: gha-runner-scale-set-controller
template:
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  spec:
    dockerdWithinRunnerContainer: true
    nodeSelector:
      cloud.google.com/gke-nodepool: build-heavy-compute
      kubernetes.io/arch: amd64
      kubernetes.io/os: linux
    containers:
    - name: runner
      image: <redacted> # This is a large image ~15GB, ~6 min to pull uncached
      command:
        - bash
        - -c
        - "mkdir -p /home/runner/.docker/docker /home/runner/.local/share && ln -s /home/runner/.docker/docker /home/runner/.local/share/docker && /bin/bash /usr/bin/entrypoint-dind-rootless.sh && /bin/bash /runner/run.sh"
      securityContext:
        privileged: true
      volumeMounts:
      - mountPath: /tmp
        name: tmpdir

    tolerations:
    - key: "<redacted>.com/workload"
      operator: "Equal"
      value: "build-heavy-compute"
      effect: "NoSchedule"        
    volumes:
    - name: tmpdir
      emptyDir: {}
    resources:
      requests:
        cpu: "12000m"
        memory: "28Gi"
        ephemeral-storage: "48Gi"
      limits:
        cpu: "12000m"
        memory: "30Gi"

Controller Logs

https://gist.github.com/dillon-cullinan/db470ee50ab1b411589142d907764e9c

Runner Pod Logs

Describe Logs

https://gist.github.com/dillon-cullinan/8fafe89e61e325c6f82db977e7d52e7c

Pod Logs

None, the pod is unable to obtain a node

github-actions[bot] commented 5 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 5 months ago

Closing in favor of https://github.com/actions/actions-runner-controller/issues/3450