actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.41k stars 1.04k forks source link

New jobs are not picked up whilst there are runners sitting in Evicted status #3428

Closed treffynnon closed 2 months ago

treffynnon commented 2 months ago

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

To Reproduce

1. setup runner definition with min scaling set to 2
2. run github actions jobs until there are 3 evicted pods - 
   they accumulate over time - typically 24 hours or so
3. the evicted pods hang around for hours (seemingly they'll never disappear)

Describe the bug

When there are three pods with a status of Evicted the controller ceases to provision new runner pods for jobs. This appears to be related, but not confirmed, to the values.yaml having a minScaling of 2. The jobs begin to queue up and new pods are never created. As soon as one of the Evicted pods is deleted using kubectl the jobs start running again and new pods are provisioned.

We're running our cluster on Azure AKS with both Linux and Windows nodes - this issue has only affected the Linux node/pods so far.

Describe the expected behavior

It should not consider Evicted pods to be the same as a running pod. It should provision new pods despite the presence of Evicted pods.

In an ideal world Evicted pods would be deleted after a period of time too so that they are cleaned up too.

Additional Context

# Based on https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/values.yaml

## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
githubConfigUrl: https://github.com/MyOrg

## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
## If you have a pre-define Kubernetes secret in the same namespace the gha-runner-scale-set is going to deploy,
## you can also reference it via `githubConfigSecret: pre-defined-secret`.
githubConfigSecret: gha-runner-secret

## minRunners is the min number of idle runners. The target number of runners created will be
## calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 2

## The runner group to put this runner set into (new groups are created in the GitHub Organization UI)
runnerGroup: myorg-github-action-runners

## name of the runner scale set to create.  Defaults to the helm release name
runnerScaleSetName: myorg-github-action-runners-linux

## Container mode is an object that provides out-of-box configuration
## for dind and kubernetes mode. Template will be modified as documented under the
## template object.
##
## If any customization is required for dind or kubernetes mode, containerMode should remain
## empty, and configuration should be applied to the template.
# containerMode:
#   type: "dind"  ## type can be set to dind or kubernetes
containerMode:
  type: dind

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  spec:
    imagePullSecrets:
      - name: gha-runner-image-pull-secret
    containers:
      - name: runner
        image: ghcr.io/myorg/myorg-github-action-runners-linux:latest
        command: ['/home/runner/run.sh']
        securityContext:
          privileged: true

## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
controllerServiceAccount:
  namespace: gha-arc-controller
  name: gha-arc-controller-gha-rs-controller

This is from after I had deleted one of the Evicted pods, but you can see here that the runners are sitting around for hours.

NAME                                                   READY   STATUS    RESTARTS   AGE
myorg-github-action-runners-linux-4bpdt-runner-gplct     0/2     Evicted   0          6h31m
myorg-github-action-runners-linux-4bpdt-runner-jm4gg     2/2     Running   0          3m4s
myorg-github-action-runners-linux-4bpdt-runner-nwvqr     2/2     Running   0          3m4s
myorg-github-action-runners-linux-4bpdt-runner-qkhx4     0/2     Evicted   0          23h

Controller Logs

https://gist.github.com/treffynnon/679adf4e02b63154ee961589d50d8d5e

You can see where I deleted one of the offending Evicted pods right here in the logs:
https://gist.github.com/treffynnon/679adf4e02b63154ee961589d50d8d5e#file-gha-arc-controller-logs-L77
@ `2024-04-12T05:12:14Z` and you can see the time difference to the next line.

Runner Pod Logs

These are not applicable to the issue as the runners are all in an Evicted state so 
there are no logs to obtain. For example here is an attempt to get logs from one of them.

$> kubectl logs --namespace gha-arc-runners myorg-github-action-runners-linux-4bpdt-runner-gplct
Defaulted container "runner" out of: runner, dind, init-dind-externals (init)
Error from server (BadRequest): container "runner" in pod "myorg-github-action-runners-linux-4bpdt-runner-gplct" is not available
github-actions[bot] commented 2 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 2 months ago

Hey @treffynnon,

This is working as intended. We count failed runners in to signal some problem on the cluster, and to stop creating resources indefinitely.

We are planning to change this in the future, so I will close this issue in favour of https://github.com/actions/actions-runner-controller/issues/2721