actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.75k stars 1.12k forks source link

scale-set terminate runner-pod when minRunners=0 , if pod was in pending state for long. #3816

Open amir-bialek opened 9 hours ago

amir-bialek commented 9 hours ago

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

To Reproduce

Running with minRunners=0 :

1. Configure an EKS cluster with the following:
   - GitHub ARC controller.
   - One scale-set with `minRunners=0`.
   - Cluster Autoscaler enabled.

2. Trigger a GitHub Actions workflow that requires a runner.

3. Observe the following sequence of events:
   - The scale-set listener receives the job and deploys a new pod.
   - The pod enters a **pending state** due to no available nodes.

4. Wait for the **Cluster Autoscaler** to respond:
   - The Autoscaler scales up the node group. starting cycle

5. The pod 'disappear'.

7.   A new node is deployed.

8. Pod 're-appear', observed the pod's behavior:
   - The pod is scheduled on the newly provisioned node.
   - It pulls the necessary image.
   - The init container runs.
   - The main container starts running.
   - The pod does not run the workflow-job, instead it is terminates.

minRunners=1:
All is the same, but at point 5 the pod stay in pending, and at point 8 it is running the jobs.

Describe the bug

The controller terminates the runner-pod, and does not re-scheduled it. The workflow on github show as "waiting for runner to come back online".

Describe the expected behavior

The runner-pod should run the workflow-job. The controller should not terminate the runner-pod.

Additional Context

overwrite the default values.yaml with:

githubConfigUrl: "my_org_repo"
githubConfigSecret: github-token

minRunners: 0
maxRunners: 5

runnerScaleSetName: "my_scale_Set"

controllerServiceAccount:
  namespace: github-controller
  name: github-runner-controller

runnerGroup: "default"

template:
  spec:
    tolerations:
    - key: "need-gpu"
      operator: "Equal"
      value: "yes"
      effect: "NoSchedule"

    imagePullSecrets:
    - name: registry
    initContainers:
      - name: init-share-repo
        image: alpine/git:v2.45.2
        command: ["/bin/sh", "-c"]
        args:
        - sh "/tmp/data-script/runme.sh"
        env:
          - name: READ_TOKEN
            valueFrom:
              secretKeyRef:
                name: github-read-token
                key: token

        volumeMounts:
          - name: ed-share-folder
            mountPath: /tmp/shared-repos
          - name: data-script
            mountPath: /tmp/data-script

    containers:
      - name: runner
        image:  my_custom_image
        command: ["/home/runner/run.sh"]
        imagePullPolicy: IfNotPresent

        resources: 
          requests:
            nvidia.com/gpu: 1
            cpu: "7000m"
            memory: "20Gi"
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: empty-docker-config
            mountPath: /home/runner/.docker
          - name: docker-login-config
            mountPath: /home/runner/.docker/config.json
            subPath: config.json
            readOnly: true
          - name: ed-share-folder
            mountPath: /some/path
    volumes:
    - name: data-script
      configMap:
        name: data-script-cm
    - name: docker-login-config
      secret:
        secretName: some-secret
        items:
        - key: .dockerconfigjson
          path: config.json
    - name: empty-docker-config
      emptyDir: {}
    - name: ed-share-folder
      emptyDir: {}

containerMode:
  type: "dind"

Controller Logs

https://gist.github.com/amir-bialek/9a9bd3ab45847b4dd285b86cf51ea069

Runner Pod Logs

irrelevant - issue coming from the controller
github-actions[bot] commented 9 hours ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

amir-bialek commented 5 hours ago

Similar issue here: https://github.com/actions/actions-runner-controller/issues/2850