Runners don't scale down if there are any `num_terminating_busy` replicas

oeuftete commented 1 year ago

Checks

[X] I've already read https://github.com/actions-runner-controller/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.26.0

Helm Chart Version

0.21.0

CertManager Version

1.8.0

Deployment Method

Helm

cert-manager installation

✅

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "11"
    meta.helm.sh/release-name: actions-runner-controller
    meta.helm.sh/release-namespace: actions-runner-system
  labels:
    app.kubernetes.io/instance: actions-runner-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.26.0
    helm.sh/chart: actions-runner-controller-0.21.0
  name: actions-runner-controller
  namespace: actions-runner-system
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: actions-runner-controller
      app.kubernetes.io/name: actions-runner-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        kubectl.kubernetes.io/default-logs-container: manager
      labels:
        app.kubernetes.io/instance: actions-runner-controller
        app.kubernetes.io/name: actions-runner-controller
    spec:
      containers:
      - args:
        - --metrics-addr=127.0.0.1:8080
        - --enable-leader-election
        - --port=9443
        - --sync-period=5m
        - --default-scale-down-delay=10m
        - --docker-image=docker:dind
        - --runner-image=summerwind/actions-runner:latest
        command:
        - /manager
        env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              key: github_token
              name: controller-manager
              optional: true
        - name: GITHUB_APP_ID
          valueFrom:
            secretKeyRef:
              key: github_app_id
              name: controller-manager
              optional: true
        - name: GITHUB_APP_INSTALLATION_ID
          valueFrom:
            secretKeyRef:
              key: github_app_installation_id
              name: controller-manager
              optional: true
        - name: GITHUB_APP_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              key: github_app_private_key
              name: controller-manager
              optional: true
        - name: GITHUB_BASICAUTH_PASSWORD
          valueFrom:
            secretKeyRef:
              key: github_basicauth_password
              name: controller-manager
              optional: true
        image: summerwind/actions-runner-controller:v0.26.0
        imagePullPolicy: IfNotPresent
        name: manager
        ports:
        - containerPort: 9443
          name: webhook-server
          protocol: TCP
        resources: {}
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/actions-runner-controller
          name: secret
          readOnly: true
        - mountPath: /tmp
          name: tmp
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true
      - args:
        - --secure-listen-address=0.0.0.0:8443
        - --upstream=http://127.0.0.1:8080/
        - --logtostderr=true
        - --v=10
        image: quay.io/brancz/kube-rbac-proxy:v0.13.0
        imagePullPolicy: IfNotPresent
        name: kube-rbac-proxy
        ports:
        - containerPort: 8443
          name: metrics-port
          protocol: TCP
        resources: {}
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: actions-runner-controller
      serviceAccountName: actions-runner-controller
      terminationGracePeriodSeconds: 10
      volumes:
      - name: secret
        secret:
          defaultMode: 420
          secretName: controller-manager
      - name: cert
        secret:
          defaultMode: 420
          secretName: actions-runner-controller-serving-cert
      - emptyDir: {}
        name: tmp
--
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"HorizontalRunnerAutoscaler","metadata":{"annotations":{},"name":"evi-platform-study-dev-runner","namespace":"actions-runner-system"},"spec":{"maxReplicas":5,"metrics":[{"scaleDownAdjustment":1,"scaleDownThreshold":"0.3","scaleUpAdjustment":1,"scaleUpThreshold":"0.75","type":"PercentageRunnersBusy"}],"minReplicas":1,"scaleDownDelaySecondsAfterScaleOut":600,"scaleTargetRef":{"name":"evi-platform-study-dev-runner"}}}
  name: evi-platform-study-dev-runner
  namespace: actions-runner-system
spec:
  maxReplicas: 5
  metrics:
  - scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 1
    scaleUpThreshold: "0.75"
    type: PercentageRunnersBusy
  minReplicas: 1
  scaleDownDelaySecondsAfterScaleOut: 600
  scaleTargetRef:
    name: evi-platform-study-dev-runner

To Reproduce

1. Run some sort of ill-behaved job on the runner that will be terminated but end up stuck in `Terminating`.  Sorry, I haven't got to the bottom of what's going on with this part, but I understand from the troubleshooting FAQ that this can happen.
2. Drive jobs such that runner replicas are scaled beyond the minimum.
3. Wait for jobs to complete.
4. Observe that runners don't scale down.

Describe the bug

Although there were no non-terminating busy runners, desired replicas remained at 5. Once the Terminating pods were removed by removing their finalizers, scaledown to the reserved limit occurred in the next cycle.

2022-11-23T01:51:10Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 5 by PercentageRunnersBusy    {"replicas_desired_before": 5, "replicas_desired": 5, "num_runners": 5, "num_runners_registered": 5, "num_runners_busy": 0, "num_terminating_busy": 2, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-11-23T01:51:10Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 5    {"horizontalrunnerautoscaler": "actions-runner-system/evi-platform-study-dev-runner", "suggested": 5, "reserved": 0, "min": 1, "max": 5}

Describe the expected behavior

Even with stuck Terminating pods, idle runners are scaled down.

Whole Controller Logs

https://gist.github.com/oeuftete/f83e1d0ff1efb2197071712e9c17ce6c

Whole Runner Pod Logs

https://gist.github.com/oeuftete/6b92499085d3712b18d25b6d151138a4

Additional Context

No response

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

mumoshu commented 1 year ago

@oeuftete Hey. Could you share runner pod logs and a few kubectl describe pod outputs from ill-behaved runner pods? Those are crucial to further diagnose this issue, as that might tell you why the runner pods are stuck in Terminating.

oeuftete commented 1 year ago

@oeuftete Hey. Could you share runner pod logs and a few kubectl describe pod outputs from ill-behaved runner pods? Those are crucial to further diagnose this issue, as that might tell you why the runner pods are stuck in Terminating.

@mumoshu I'll have to wait until it happens again, though I expect it will soon.

I'm not really concerned about why the pods are stuck in the context of this issue, though. The issue I wanted to raise here was that having a stuck pod seems to prevent downscaling of runners that are healthy Running but idle. Once I remove the stuck Terminating pods, the downscaling of the idle Running pods happens more or less immediately.

mumoshu commented 1 year ago

@oeuftete Thanks for your confirmation! I guess it was just a coincidence. PercentageRunnersBusy works solely based on responses from some GitHub Actions API calls that might be cached and hence delayed approx 60 seconds or so to reflect the actual state of the runners. ARC doesn't consider pod statuses as far as the calculation of the desired replicas is concerned. Maybe we can see if it's actually a coincidence or not if you could provide more logs that I asked.

oeuftete commented 1 year ago

@mumoshu I've added new logs (including the pod logs, as exported from Datadog) now in the edited summary. I patched the finalizer on the one pod stuck in Terminating at ~16:24:15 UTC. You can see the suggested desired replicas drop rapidly from 3 to 0 once the single stuck Terminating pod is cleaned up.

Edit: added gist for the single stuck pod's describe: https://gist.github.com/oeuftete/b3ae28123d69330638e04aedd5ef6039

❯ grep Suggested /tmp/arc.log | tail -5
2022-12-08T16:19:19Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 3 by PercentageRunnersBusy    {"replicas_desired_before": 3, "replicas_desired": 3, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:24:07Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 3 by PercentageRunnersBusy    {"replicas_desired_before": 3, "replicas_desired": 3, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 2 by PercentageRunnersBusy    {"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 1 by PercentageRunnersBusy    {"replicas_desired_before": 2, "replicas_desired": 1, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 0 by PercentageRunnersBusy    {"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}

mar-pan commented 1 year ago

I'm facing this issue also

mumoshu commented 1 year ago

@mar-pan Hey! Are you still using PercentageRunnersBusy? Our recommended autoscaling solution is either a webhook-based one or the new RunnerScaleSet which is currently in the beta testing phase.

actions / actions-runner-controller