actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.73k stars 1.12k forks source link

Webhook ARC Scaling issue #2306

Open uralsemih opened 1 year ago

uralsemih commented 1 year ago

Checks

Controller Version

v0.27.0

Helm Chart Version

0.22.0

CertManager Version

1.5.3

Deployment Method

Helm

cert-manager installation

Yes

Checks

Resource Definitions

$ kubectl get runnerdeployment gha-staging-8c-32gb -n staging-8c-32gb -o yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    meta.helm.sh/release-name: staging-8c-32gb
    meta.helm.sh/release-namespace: staging-8c-32gb
  creationTimestamp: "2022-10-26T17:57:43Z"
  generation: 368232
  labels:
    app.kubernetes.io/managed-by: Helm
  name: gha-staging-8c-32gb
  namespace: staging-8c-32gb
  resourceVersion: "94653290"
  uid: 4c1cd628-8c0a-489a-bb6b-4475cae150db
spec:
  effectiveTime: "2023-02-21T23:07:20Z"
  replicas: 5
  selector: null
  template:
    metadata: {}
    spec:
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: docker-extra
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      env:
      - name: RUNNER_FEATURE_FLAG_EPHEMERAL
        value: "true"
      - name: DISABLE_RUNNER_UPDATE
        value: "false"
      group: secondary-staging-8c-32gb
      image: prodweuadoacr.azurecr.io/github-runner-full:latest-stable
      imagePullSecrets:
      - name: artifactory-cred
      initContainers:
      - command:
        - sh
        - -c
        - chmod -R 606 /dev/kvm
        image: busybox
        name: kvm-permission
        resources: {}
        volumeMounts:
        - mountPath: /dev/kvm
          name: kvm-device
      labels:
      - staging-8c-32gb
      - weu
      - self-hosted
      - Linux
      - linux
      - X64
      - x64
      nodeSelector:
        agentpool: stgrunneri
      organization: tomtom-internal
      resources:
        limits:
          cpu: "8"
          ephemeral-storage: 250Gi
          memory: 32Gi
        requests:
          cpu: "8"
          ephemeral-storage: 250Gi
          memory: 32Gi
      sidecarContainers:
      - command:
        - bash
        - -c
        - --
        - bash -c 'while true; do tail -f /runner/_diag/*.log; [ $? -ne 0 ] && sleep
          3; done;'
        image: bash:latest
        name: logger
        resources: {}
        volumeMounts:
        - mountPath: /runner/_diag
          name: runnerlogs
      volumeMounts:
      - mountPath: /runner/_diag
        name: runnerlogs
      - mountPath: /dev/kvm
        name: kvm-device
      volumes:
      - hostPath:
          path: /mnt/docker-extra
          type: DirectoryOrCreate
        name: docker-extra
      - hostPath:
          path: /dev/kvm
        name: kvm-device
      - emptyDir: {}
        name: runnerlogs
status:
  availableReplicas: 6
  desiredReplicas: 5
  readyReplicas: 6
  replicas: 6
  updatedReplicas: 6

---
$ kubectl get hra gha-staging-8c-32gb-autoscaler -n staging-8c-32gb -o yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: staging-8c-32gb
    meta.helm.sh/release-namespace: staging-8c-32gb
  creationTimestamp: "2023-02-02T15:06:09Z"
  generation: 41513
  labels:
    app.kubernetes.io/managed-by: Helm
  name: gha-staging-8c-32gb-autoscaler
  namespace: staging-8c-32gb
  resourceVersion: "94652004"
  uid: 9ac79c7a-135b-4802-a8ca-aa45e6f47f78
spec:
  capacityReservations:
  - effectiveTime: "2023-02-21T23:05:32Z"
    expirationTime: "2023-02-21T23:35:32Z"
    replicas: 1
  - effectiveTime: "2023-02-21T23:05:32Z"
    expirationTime: "2023-02-21T23:35:32Z"
    replicas: 1
  - effectiveTime: "2023-02-21T23:07:20Z"
    expirationTime: "2023-02-21T23:37:20Z"
    replicas: 1
  maxReplicas: 200
  minReplicas: 2
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    kind: RunnerDeployment
    name: gha-staging-8c-32gb
  scaleUpTriggers:
  - duration: 30m
    githubEvent:
      workflowJob: {}
status:
  desiredReplicas: 5
  lastSuccessfulScaleOutTime: "2023-02-21T23:07:20Z"

---
$ kubectl get storageclass
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azurefile               file.csi.azure.com   Delete          Immediate              true                   118d
azurefile-csi           file.csi.azure.com   Delete          Immediate              true                   118d
azurefile-csi-premium   file.csi.azure.com   Delete          Immediate              true                   118d
azurefile-premium       file.csi.azure.com   Delete          Immediate              true                   118d
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   118d
managed                 disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   118d
managed-csi             disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   118d
managed-csi-premium     disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   118d
managed-premium         disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   118d

To Reproduce

1

Describe the bug

We are facing long queue times where workflow/job can not able to pick up a runner. I am not sure whether Failed to update runnerreplicaset resource and Runner pod is annotated to wait for completion, and the runner container is not restarting give some hints.

image

When we set manually or increase the minReplica, the workflow can able to pick up a runners as you can see from below picture, but I don't think we should manually manage this.

image

Here are some logs from controller.

2023-02-21T22:05:01Z    ERROR   runnerdeployment    Failed to update runnerreplicaset resource  {"runnerdeployment": "staging-8c-32gb/gha-staging-8c-32gb", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"gha-staging-8c-32gb-mplvl\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:05:01Z    ERROR   Reconciler error    {"controller": "runnerdeployment-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerDeployment", "RunnerDeployment": {"name":"gha-staging-8c-32gb","namespace":"staging-8c-32gb"}, "namespace": "staging-8c-32gb", "name": "gha-staging-8c-32gb", "reconcileID": "8c73a8ea-e026-427a-8d43-300cf18bd83e", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"gha-staging-8c-32gb-mplvl\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:18:25Z    ERROR   runnerreplicaset    Failed to patch owner to have actions-runner/unregistration-request-timestamp annotation    {"runnerreplicaset": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl", "lastSyncTime": "2023-02-21T22:07:34Z", "effectiveTime": "2023-02-21 22:07:19 +0000 UTC", "templateHashDesired": "59bd66fdd9", "replicasDesired": 78, "replicasPending": 0, "replicasRunning": 82, "replicasMaybeRunning": 82, "templateHashObserved": ["59bd66fdd9"], "owner": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl-g695w", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-g695w\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:18:25Z    ERROR   Reconciler error    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "RunnerReplicaSet": {"name":"gha-staging-05c-2gb-zr4kl","namespace":"staging-05c-2gb"}, "namespace": "staging-05c-2gb", "name": "gha-staging-05c-2gb-zr4kl", "reconcileID": "e950f153-46ab-4ac8-aaff-b0ee11dd5f3f", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-g695w\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:32:15Z    ERROR   runnerreplicaset    Failed to patch owner to have actions-runner/unregistration-request-timestamp annotation    {"runnerreplicaset": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl", "lastSyncTime": "2023-02-21T22:29:28Z", "effectiveTime": "2023-02-21 22:07:19 +0000 UTC", "templateHashDesired": "59bd66fdd9", "replicasDesired": 75, "replicasPending": 0, "replicasRunning": 77, "replicasMaybeRunning": 77, "templateHashObserved": ["59bd66fdd9"], "owner": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl-9jt2c", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-9jt2c\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:32:15Z    ERROR   Reconciler error    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "RunnerReplicaSet": {"name":"gha-staging-05c-2gb-zr4kl","namespace":"staging-05c-2gb"}, "namespace": "staging-05c-2gb", "name": "gha-staging-05c-2gb-zr4kl", "reconcileID": "306f0438-ad62-42e0-9681-c979df6766d7", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-9jt2c\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:46:05Z    ERROR   runnerreplicaset    Failed to patch owner to have actions-runner/unregistration-request-timestamp annotation    {"runnerreplicaset": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl", "lastSyncTime": "2023-02-21T22:41:11Z", "effectiveTime": "2023-02-21 22:40:59 +0000 UTC", "templateHashDesired": "59bd66fdd9", "replicasDesired": 72, "replicasPending": 0, "replicasRunning": 74, "replicasMaybeRunning": 74, "templateHashObserved": ["59bd66fdd9"], "owner": "staging-05c-2gb/gha-staging-05c-2gb-zr4kl-7hp5c", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-7hp5c\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T22:46:05Z    ERROR   Reconciler error    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "RunnerReplicaSet": {"name":"gha-staging-05c-2gb-zr4kl","namespace":"staging-05c-2gb"}, "namespace": "staging-05c-2gb", "name": "gha-staging-05c-2gb-zr4kl", "reconcileID": "4a7ff5f9-cb3b-4c1d-b304-162f0aca1623", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"gha-staging-05c-2gb-zr4kl-7hp5c\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-21T23:05:39Z    INFO    runnerpod   Runner pod is annotated to wait for completion, and the runner container is not restarting  {"runnerpod": "staging-8c-32gb/gha-staging-8c-32gb-mplvl-qsdn9"}
2023-02-21T23:05:39Z    INFO    runnerpod   Runner pod is annotated to wait for completion, and the runner container is not restarting  {"runnerpod": "staging-8c-32gb/gha-staging-8c-32gb-mplvl-qq4ft"}
2023-02-21T23:05:39Z    INFO    runnerpod   Runner pod is annotated to wait for completion, and the runner container is not restarting  {"runnerpod": "staging-8c-32gb/gha-staging-8c-32gb-mplvl-7v4sc"}
2023-02-21T23:05:40Z    INFO    runnerpod   Runner pod is annotated to wait for completion, and the runner container is not restarting  {"runnerpod": "staging-8c-32gb/gha-staging-8c-32gb-mplvl-zd2kh"}
...

Describe the expected behavior

Runner should be scaled up and scaled down properly.

Whole Controller Logs

https://gist.github.com/uralsemih/3eaf224add2fe6db5102aff738b78e2c

Whole Runner Pod Logs

https://gist.github.com/uralsemih/a237fd983236b8e724e8a5ca90bf3a0c

Additional Context

We have different RunnerDeployment and HRA per namespaces for different runner groups as below.

$ kubectl get ns
NAME                    STATUS   AGE
actions-runner-system   Active   118d
cert-manager            Active   118d
staging-05c-2gb         Active   118d
staging-1c-4gb          Active   118d
staging-2c-8gb          Active   118d
staging-4c-16gb         Active   118d
staging-8c-32gb         Active   118d
staging-8c-32gb-nav     Active   118d
$ kubectl get runnerdeployment --all-namespaces
NAMESPACE             NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
staging-05c-2gb       gha-staging-05c-2gb       67        66        66           66          118d
staging-1c-4gb        gha-staging-1c-4gb        1         1         1            1           118d
staging-2c-8gb        gha-staging-2c-8gb        2         3         3            3           118d
staging-4c-16gb       gha-staging-4c-16gb       3         3         3            3           118d
staging-8c-32gb-nav   gha-staging-8c-32gb-nav   4         4         4            4           118d
staging-8c-32gb       gha-staging-8c-32gb       5         6         6            6           118d
$ kubectl get hra --all-namespaces
NAMESPACE             NAME                                 MIN   MAX   DESIRED   SCHEDULE
staging-05c-2gb       gha-staging-05c-2gb-autoscaler       1     200   67
staging-1c-4gb        gha-staging-1c-4gb-autoscaler        1     200   1
staging-2c-8gb        gha-staging-2c-8gb-autoscaler        1     200   2
staging-4c-16gb       gha-staging-4c-16gb-autoscaler       1     200   3
staging-8c-32gb-nav   gha-staging-8c-32gb-nav-autoscaler   1     30    4
staging-8c-32gb       gha-staging-8c-32gb-autoscaler       2     200   5
github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

omer2500 commented 1 year ago

@uralsemih we also have issues similar to yours with webhook auto-scaling

My issue is this: webhooks are coming everything is green ARC doesn't spawn more runners to take the jobs, and sometimes it does after 10-20 minutes or sometimes I need to re-trigger the job for it to be assigned to a runner also scaling from 0 seems to be worse

Hi @mumoshu This is the 4-5th issue I see with issues regarding the webhook scaling ( no matter if it's from 0 or not)

Others that are related: https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1439016359 https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1436038160 https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1439016359 https://github.com/actions/actions-runner-controller/issues/2254

I think it could be a bug on the latest version is there any ETA for checking the issue? maybe someone else can help here?

Thanks for the hard work!

semihural-tomtom commented 1 year ago

Hello @omer2500 I made several changes which helped a bit.

I increased the scaleUpTriggers.duration to 12h which was previously 30m. We have a bunch of workflows where that take over 4-5 hours to complete, which seems 30m too fast to hit ARC for scaling.

  scaleUpTriggers:
  - duration: 12h
    githubEvent:
      workflowJob: {}

Please have a look at scheduledOverrides if you are not using it. It's a cool feature where you can define minReplica for a specific period.

https://github.com/actions/actions-runner-controller/blob/b6515fe25c3d14f7d1dae244ca86de7e76575081/docs/automatically-scaling-runners.md#scheduled-overrides

spec:
  maxReplicas: 100
  minReplicas: 10
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    kind: RunnerDeployment
    name: gha-staging-8c-32gb
  scaleUpTriggers:
  - duration: 12h
    githubEvent:
      workflowJob: {}
  scheduledOverrides:
    startTime: "2023-02-25T00:00:00+01:00"
  - endTime: "2023-02-24T22:00:00+01:00"
    minReplicas: 50
    recurrenceRule:
      frequency: Daily

I check the codebase and I think, this PR is related to what we are facing. Looking forward to see in next release. https://github.com/actions/actions-runner-controller/pull/2258

mumoshu commented 1 year ago

@uralsemih @omer2500 Hey! Thank you for reporting. I believe your issue is basically this:

This is a long-standing issue that we are going to fix via #2258, as @uralsemih has kindly mentioned! If you need a workaround today, try setting minReplicas to a large value. If you are concerned about cost of setting a large value for it, use scheduled overrides. ARC's scheduled overrides allow you to set large minReplicas only for your working hours, which might give you the right balance between cost and reliability.

Once #2258 is released, you can omit the workaround. However, keep in mind that ARC relies on webhook for scaling. There's no guarantee ARC can receive every webhook event sent by GitHub. If ARC missed receiving a status=queued workflow_job event, it would miss a scale-up anyway. That said, it's still your responsibility to set a correct minReplicas even after #2258. If you absolutely need to keep, say 10, runners no matter how many webhook events ARC missed, set minReplicas of 10. #2258 would make scale-up more reliable though, which might allow you to use lower minReplicas with fewer issues than today.

You'd also need to properly configure scale trigger duration, regardless of #2258. The duration needs to be considerably longer than the max duration of all your workflow jobs. Otherwise the runner replica added via a workflow job might "expire" and gets deleted before the runner is actually used by any job.

emmahsax commented 1 year ago

We're still seeing some scaling issues since #2258 was released. Particularly, if we have the minReplicas set to 1. Basically, my single replica correctly picks up jobs, but won't pick up all 5 jobs at once, which it should. It also doesn't seem to scale up if there's 3 running jobs as minReplicas, so it just seems like it's not really scaling up at all.

NOTE: I know that realistically, we'd set a minReplicas to 5 or 10 or something just so there's always some replicas running. But I'm setting it to lower for testing purposes only.

We're using this workflow to test it ```yml name: Run Many Jobs on: workflow_dispatch: inputs: time_to_sleep: default: '120' description: How much time (in seconds) to sleep in each job required: false type: string jobs: job_1: runs-on: [self-hosted, c5.large] steps: - run: sleep ${{ inputs.time_to_sleep }} job_2: runs-on: [self-hosted, c5.large] steps: - run: sleep ${{ inputs.time_to_sleep }} job_3: runs-on: [self-hosted, c5.large] steps: - run: sleep ${{ inputs.time_to_sleep }} job_4: runs-on: [self-hosted, c5.large] steps: - run: sleep ${{ inputs.time_to_sleep }} job_5: runs-on: [self-hosted, c5.large] steps: - run: sleep ${{ inputs.time_to_sleep }} ```
Our basic runner deployment looks like this ```yml apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: 'c5.large' spec: maxReplicas: ${var.replicas.maximum} minReplicas: ${var.replicas.minimum} scaleDownDelaySecondsAfterScaleOut: 600 scaleTargetRef: kind: RunnerDeployment name: 'c5.large' scaleUpTriggers: - duration: "60m" githubEvent: workflowJob: {} status: desiredReplicas: 1 lastSuccessfulScaleOutTime: "2023-06-02T16:42:10Z" ```

Am I missing a setting or a configuration in the horizontal runner autoscaler? Or am I incorrect and #2258 has not actually been released yet?

Controller Version

0.24.4

Helm Chart Version

0.23.3

CertManager Version

1.10.0

Deployment Method

Helm

cert-manager installation

Yes

mumoshu commented 1 year ago

@emmahsax Hey! Thanks for reporting. From what you provided, I can only say it should just work. What's your maxReplicas when it doesn't work as you expect?

If my question about maxReplicas doesn't help, I'd appreciate it if you could file a dedicated issue for your case! Our bug report form is full of important fields, which is super helpful for debugging this kind of issue.

mattdavis0351 commented 1 year ago

Still experiencing the issue myself. I have an even simpler setup than @emmahsax as it pretty much follows the bare minimum in the documentation and no matter what I try I can only get one job to run at a time. There is no autoscaling taking place at all.

emmahsax commented 1 year ago

I will say that I did realize I misunderstood the docs. I was reading from the docs here, and didn't realize that the Install with Helm portion was actually required (or the Install with Kustomize), but that either way you had to tell GitHub to send webhooks 🤦🏼‍♀️ .

Therefore, I'm still in process of setting that stuff up (got distracted with other things). Who knows if that would fix things.... 🤞🏼

c-p-b commented 1 year ago

So I've been running into problems with the webhooks scaling implementation as well. I've got this deployed across several repositories right now, and I'm facing similar issues as other end users. Looking at the logs, what ends up eventually happening each time is that somehow the count of how many desired replicas need to be scheduled gets out of whack with how many things need to actually scheduled. I haven't pinpointed exactly why, whether it's some miscalculation, some webhooks are getting double counted, or something else.

For reference, what we do is have ARC schedule the pod, and if there isn't a machine, we autoscale it up with Karpenter - so the runner will be stuck in pending for a minute or two while aws brings up a machine to schedule it on. I'm not sure if the pod being stuck in pending like that contributes or not to counts getting out of whack, but I also think it's related to the amount of up and down activity happening in the repository - we have an internal repository with significantly less traffic that scales to 0 using the webhooks based implementation and it doesn't seem to have any issues like airbytehq/airbyte does.

I did pick up a canary build to deploy https://github.com/actions/actions-runner-controller/pull/2502. It does look like it improved the situation slightly, but webhooks scaling appears to still basically be unusable on its own for the amount of traffic on airbytehq/airbyte - jobs will be stuck in pending for hours waiting to get a runner when they should be immediately scaling up.

What I put in yesterday as a workaround seems to have basically fixed it - I put in pull based autoscaling as well that looks like this:

Now the pull based will follow behind whenever webhooks gets out of whack and slowly resolve the issue. It's of course not ideal, but it at least makes the implementation usable while giving us the benefits of (usually) spawning runners faster than pull based alone. I will probably end up tweaking anti-flapping config to be a bit shorter under this setup to save a little bit of money as currently with a lot of inbounding scale ups the anti-flapping takes awhile to bring things back down.

  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: "720m"
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '1.0'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.99'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpAdjustment: 1        # The scale up runner count added to desired count
    scaleDownAdjustment: 1      # The scale down runner count subtracted from the desired count
Nuru commented 1 year ago

@cpdeethree I have a couple of suggestions for things to look at regarding your issues with HRA not scaling up your busy runner deployments adequately. I am including issues that are probably not affecting you so that others reading this issue can benefit.

First is that HRA.spec.scaleUpTriggers[].duration needs to be set to to a duration that is longer than you expect a job to run, from the time it is queued to the time the job is finished, assuming a runner is available. I see you set it to "720m" so that is probably long enough, but still check to make sure. Every time a job takes longer than this duration, your runner group will be scaled down 1 runner, and you will be perpetually under-resourced until you hit your minimum number of runners. (And if your minimum is zero, your capacity may never recover.)

Of course, make sure that HRA.spec.maxReplicas is set high enough to accommodate all the jobs you want to run in parallel. Again, if your pull-based scaling is working, then you probably have this set adequately, too.

With that out of the way, my guess is that you are running jobs with runs-on selectors that match more than one runner group. Perhaps you are matching both a repository and an organization runner group. The webhook-based autoscaler will scale the first runner group it finds (see #2798), which may not be the one the job runs on. Every time the wrong group is scaled, you will end up with a job waiting on the queue. Then if your minimum group size is too small, it will not work through the backlog before you start getting new jobs queued up.

c-p-b commented 1 year ago

Following up here. I did end up changing HRA.spec.scaleUpTriggers[].duration to > 24 hours (which is what our max ci timeout was in fact set to) and that appears to have resolved the issue. It’s a shame that there’s no easy way to configure it so that you can’t footgun yourself like that, but we are talking two entirely disparate systems here. Maybe a simple cron that checks periodically to ensure that the two values make sense together is warranted, but that would probably have to be something that we implement ourselves