actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Scale Set Contoller stuck: thinks runner pods are failed, but they have been deleted #3592

Closed garymm closed 5 months ago

garymm commented 5 months ago

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

To Reproduce

Not sure but a guess:
1. Deploy a scale set controller
1. Deploy a scale set
1. Trigger some jobs
1. The jobs fail
1. ??? Not sure ???

Describe the bug

The controller thinks that the runners are in state failed forever and thus won't scale the scale set up anymore. In reality the runners have been deleted and no longer exist. In the controller log we see: "name":"berkeley-gpu-runners-pltxc-runner-s4b7j","namespace":"gh-actions-runner-scale-sets"

But:

❯ kubectl get pod berkeley-gpu-runners-pltxc-runner-s4b7j -n gh-actions-runner-scale-sets
Error from server (NotFound): pods "berkeley-gpu-runners-pltxc-runner-s4b7j" not found

Describe the expected behavior

The controller pod notices that the runner pods no longer exist and reacts appropriately.

Additional Context

controller values are all defaults.

scale set values:

githubConfigSecret: pre-defined-secret
githubConfigUrl: https://github.com/Astera-org
controllerServiceAccount:
  namespace: gha-runner-scale-set-controller
  name: gha-runner-scale-set-controller-gha-rs-controller
minRunners: 1
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            nvidia.com/gpu: 1

Controller Logs

https://gist.github.com/garymm/22ad6e8e9f6a38576e138a72793dc67a

Runner Pod Logs

There are no pods running.
github-actions[bot] commented 5 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

garymm commented 5 months ago

Oh I did find a resource of type ephemeralrunner.actions.github.com and that shows that the issues is "Pod has failed to tart more than 5 times". So I guess maybe I'll close this and vote for https://github.com/actions/actions-runner-controller/issues/2721