actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.66k stars 1.1k forks source link

Enabling both webhook and PercentageRunnersBusy for HRA leads to excessive runner pod creation #1962

Open stephanosio opened 1 year ago

stephanosio commented 1 year ago

Checks

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

1.9.1

Deployment Method

Helm

cert-manager installation

cert-manager 1.9.1 was installed using helm.

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: test-runner-deployment
spec:
  template:
    spec:
      organization: zephyrproject-rtos
      labels:
      - test-runner
      containers:
      - name: runner
        image: 724087766192.dkr.ecr.us-east-2.amazonaws.com/ci-zephyr-runner-arc-runner:latest
        resources:
          limits:
            cpu: "16.0"
            memory: "32Gi"
          requests:
            cpu: "7.5"
            memory: "12Gi"
        volumeMounts:
        - mountPath: "/repo-cache"
          name: runner-repo-cache
          readOnly: true
      - name: docker
        image: 724087766192.dkr.ecr.us-east-2.amazonaws.com/ci-zephyr-runner-arc-dind:latest
        resources:
          limits:
            cpu: "16.0"
            memory: "32Gi"
          requests:
            cpu: "7.5"
            memory: "12Gi"
        volumeMounts:
        - mountPath: "/repo-cache"
          name: runner-repo-cache
          readOnly: true
      volumes:
      - name: runner-repo-cache
        persistentVolumeClaim:
          claimName: runner-repo-cache-pvc
      nodeSelector:
        InstanceType: spot
      tolerations:
      - key: "spotInstance"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: test-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: test-runner-deployment
  minReplicas: 0
  maxReplicas: 100
  scaleUpTriggers:
  - githubEvent:
      workflowJob: {}
    duration: "15m"
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpAdjustment: 1
    scaleDownAdjustment: 1

To Reproduce

1. Apply the above resource definitions (create an HRA with both `workflowJob` scale-up trigger and `PercentageRunnersBusy` metric).
2. Queue 15 jobs (a workflow that launches 15 jobs)
3. Observe the controller create runner pods up to the `maxReplicas` count (in case of the provided HRA, 100).

Describe the bug

Enabling both webhook (workflowJob scale-up trigger) and PercentageRunnersBusy metric for HRA leads to excessive runner pod creation (in my case, 100 pods = maxReplicas) when many jobs (in my case, 15) are simultaneously queued:

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: test-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: test-runner-deployment
  minReplicas: 0
  maxReplicas: 100
  scaleUpTriggers:
  - githubEvent:
      workflowJob: {}
    duration: "15m"
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpAdjustment: 1
    scaleDownAdjustment: 1

Describe the expected behavior

When 15 jobs are queued, the runner pods are only scaled up to 15 + alpha (from PercentageRunnersBusy), not 100 (maxReplicas).

Whole Controller Logs

Some interesting excerpts:

2022-10-28T11:16:49Z    DEBUG   actions-runner-controller.runnerreplicaset  Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "default/test-runner-deployment-4zwql", "owner": "default/test-runner-deployment-4zwql-wf6j9", "pods": null}
2022-10-28T11:16:49Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 15 by PercentageRunnersBusy   {"replicas_desired_before": 16, "replicas_desired": 15, "num_runners": 12, "num_runners_registered": 0, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "default", "kind": "runnerdeployment", "name": "test-runner-deployment", "horizontal_runner_autoscaler": "test-runner-deployment-autoscaler", "enterprise": "", "organization": "zephyrproject-rtos", "repository": ""}
2022-10-28T11:16:49Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 17   {"horizontalrunnerautoscaler": "default/test-runner-deployment-autoscaler", "suggested": 15, "reserved": 2, "min": 0, "max": 100}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "c77c9f08-a13f-453a-9516-a9ba263b6284", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "c77c9f08-a13f-453a-9516-a9ba263b6284", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   actions-runner-controller.runnerreplicaset  Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "default/test-runner-deployment-4zwql", "owner": "default/test-runner-deployment-4zwql-wf6j9", "pods": null}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "6a27cde3-22de-4029-8156-115a59ec2ddc", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:49Z    INFO    runnerdeployment-resource   validate resource to be updated {"name": "test-runner-deployment"}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "6a27cde3-22de-4029-8156-115a59ec2ddc", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "92906dc0-2188-41de-8888-66b0f3825f36", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "92906dc0-2188-41de-8888-66b0f3825f36", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "51a2871a-e102-4a81-bd0f-54bc3297b2fa", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "51a2871a-e102-4a81-bd0f-54bc3297b2fa", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "eb8ec9de-2960-4e35-929a-6e7a11348011", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:49Z    INFO    runner-resource validate resource to be updated {"name": "test-runner-deployment-4zwql-hjs9d"}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "eb8ec9de-2960-4e35-929a-6e7a11348011", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "80ad54cd-e4b0-4746-9e69-3b1f1d52a94f", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-10-28T11:16:49Z    INFO    runnerreplicaset-resource   validate resource to be updated {"name": "test-runner-deployment-4zwql"}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "80ad54cd-e4b0-4746-9e69-3b1f1d52a94f", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "e4533542-71c0-4019-94f3-abf9fe0ddafa", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:49Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "e4533542-71c0-4019-94f3-abf9fe0ddafa", "allowed": true}
2022-10-28T11:16:49Z    DEBUG   actions-runner-controller.runnerreplicaset  Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "default/test-runner-deployment-4zwql", "owner": "default/test-runner-deployment-4zwql-wf6j9", "pods": null}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "a4ffabe9-c625-483d-933d-f5c9e58fded3", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:50Z    INFO    runnerdeployment-resource   validate resource to be updated {"name": "test-runner-deployment"}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "a4ffabe9-c625-483d-933d-f5c9e58fded3", "allowed": true}
2022-10-28T11:16:50Z    DEBUG   actions-runner-controller.runnerdeployment  Updated runnerreplicaset due to spec change {"runnerdeployment": "default/test-runner-deployment", "currentDesiredReplicas": 15, "newDesiredReplicas": 16, "currentEffectiveTime": "2022-10-28 11:16:49 +0000 UTC", "newEffectiveTime": "2022-10-28 11:16:49 +0000 UTC"}
2022-10-28T11:16:50Z    DEBUG   actions-runner-controller.runnerreplicaset  Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "default/test-runner-deployment-4zwql", "owner": "default/test-runner-deployment-4zwql-cn6fv", "pods": null}
2022-10-28T11:16:50Z    DEBUG   actions-runner-controller.runnerreplicaset  Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "default/test-runner-deployment-4zwql", "owner": "default/test-runner-deployment-4zwql-7zkzh", "pods": null}
2022-10-28T11:16:50Z    INFO    actions-runner-controller.runner    Updated registration token  {"runner": "test-runner-deployment-4zwql-qvs6d", "repository": ""}
2022-10-28T11:16:50Z    DEBUG   events  Normal  {"object": {"kind":"Runner","namespace":"default","name":"test-runner-deployment-4zwql-qvs6d","uid":"1e32c7c9-2d20-4671-9483-4debccebd57c","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"1042339"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "0e2da8cf-c718-4aa9-ab65-1eb2d720e034", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "0e2da8cf-c718-4aa9-ab65-1eb2d720e034", "allowed": true}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "da688fcd-2a0b-4a84-bec8-ac978db0e165", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:50Z    INFO    runner-resource validate resource to be updated {"name": "test-runner-deployment-4zwql-cn6fv"}
2022-10-28T11:16:50Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "da688fcd-2a0b-4a84-bec8-ac978db0e165", "allowed": true}
2022-10-28T11:16:50Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 16 by PercentageRunnersBusy   {"replicas_desired_before": 17, "replicas_desired": 16, "num_runners": 12, "num_runners_registered": 0, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "default", "kind": "runnerdeployment", "name": "test-runner-deployment", "horizontal_runner_autoscaler": "test-runner-deployment-autoscaler", "enterprise": "", "organization": "zephyrproject-rtos", "repository": ""}
2022-10-28T11:16:50Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 18   {"horizontalrunnerautoscaler": "default/test-runner-deployment-autoscaler", "suggested": 16, "reserved": 2, "min": 0, "max": 100}
[...]
2022-10-28T11:16:53Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 87 by PercentageRunnersBusy   {"replicas_desired_before": 88, "replicas_desired": 87, "num_runners": 30, "num_runners_registered": 0, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "default", "kind": "runnerdeployment", "name": "test-runner-deployment", "horizontal_runner_autoscaler": "test-runner-deployment-autoscaler", "enterprise": "", "organization": "zephyrproject-rtos", "repository": ""}
2022-10-28T11:16:53Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 94   {"horizontalrunnerautoscaler": "default/test-runner-deployment-autoscaler", "suggested": 87, "reserved": 7, "min": 0, "max": 100}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "7ddaa57d-73ef-4324-804a-9e39d3c4d2b7", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "7ddaa57d-73ef-4324-804a-9e39d3c4d2b7", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "6bf91cda-f7bd-43f7-bba5-ce9a6cb01feb", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-10-28T11:16:53Z    INFO    runner-resource validate resource to be created {"name": "test-runner-deployment-4zwql-7fp4j"}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "6bf91cda-f7bd-43f7-bba5-ce9a6cb01feb", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "912f6584-44ef-4f8d-90ac-e82071e0db60", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "912f6584-44ef-4f8d-90ac-e82071e0db60", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "b0bbc118-ab55-48f5-8a3d-a048d8699f3d", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-10-28T11:16:53Z    INFO    runnerreplicaset-resource   validate resource to be updated {"name": "test-runner-deployment-4zwql"}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "b0bbc118-ab55-48f5-8a3d-a048d8699f3d", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   actions-runner-controller.runnerdeployment  Updated runnerreplicaset due to spec change {"runnerdeployment": "default/test-runner-deployment", "currentDesiredReplicas": 82, "newDesiredReplicas": 88, "currentEffectiveTime": "2022-10-28 11:16:52 +0000 UTC", "newEffectiveTime": "2022-10-28 11:16:52 +0000 UTC"}
2022-10-28T11:16:53Z    INFO    actions-runner-controller.runner    Updated registration token  {"runner": "test-runner-deployment-4zwql-224sq", "repository": ""}
2022-10-28T11:16:53Z    DEBUG   events  Normal  {"object": {"kind":"Runner","namespace":"default","name":"test-runner-deployment-4zwql-224sq","uid":"64811958-5728-44ae-9afd-20695e4f7a35","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"1042755"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "d9cf541c-40e1-42dc-82d2-a20f0ae3a544", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "d9cf541c-40e1-42dc-82d2-a20f0ae3a544", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7d6e0ef4-378e-4f37-b4bd-1e77ef138b27", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-10-28T11:16:53Z    INFO    runnerdeployment-resource   validate resource to be updated {"name": "test-runner-deployment"}
2022-10-28T11:16:53Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7d6e0ef4-378e-4f37-b4bd-1e77ef138b27", "allowed": true}
2022-10-28T11:16:53Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 93 by PercentageRunnersBusy   {"replicas_desired_before": 94, "replicas_desired": 93, "num_runners": 31, "num_runners_registered": 0, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "default", "kind": "runnerdeployment", "name": "test-runner-deployment", "horizontal_runner_autoscaler": "test-runner-deployment-autoscaler", "enterprise": "", "organization": "zephyrproject-rtos", "repository": ""}
2022-10-28T11:16:53Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 100  {"horizontalrunnerautoscaler": "default/test-runner-deployment-autoscaler", "suggested": 93, "reserved": 7, "min": 0, "max": 100}

For full log, please see https://gist.github.com/stephanosio/68e9604948d827cf5a86ba8b2b152665

The errors at the end of log ("OOM killed?", ...) are because I deleted the deployment to quickly get rid of the 100 runner pods, so they can be ignored.

Whole Runner Pod Logs

100 pods are created, so not really feasible to provide the logs for every one of them ...

Additional Context

I have also tried:

p.s. the reason that I am trying to enable both webhook and PercentageRunnersBusy, as opposed to only using webhook, is that runners sometimes get terminated after start-up for some unknown reason and some jobs end up remaining in the queued status indefinitely, and using PercentageRunnersBusy will at least ensure that such jobs are scheduled by launching additional runners if all runners are busy. Is there a better existing way to address this issue -- for instance, something that periodically checks if there are queued jobs and launches additional runners? -- UPDATE: For now, I am just using TotalNumberOfQueuedAndInProgressWorkflowRuns-only configuration. That seems to be the most stable and robust solution (PercentageRunnersBusy was over-allocating the runners too much).

p.s. 2. webhook + TotalNumberOfQueuedAndInProgressWorkflowRuns configuration seems to have the same issue.

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

royerhq commented 1 year ago

+1

mumoshu commented 1 year ago

I was struggling to fix this for some time! My current understanding of the problem says that this can be a fundamental design issue which can't be fixed. In short, you'd better avoid using both pull-based and webhook-based scaling at the same time, if that works for you.

Use the webhook-based scaling only and do set a higher minReplicas if you needed to use pull-based scaling to maintain some runners while the webhook-based scaling delete runner pods aggressively.

@royerhq What's your issue? Why did you have to use pull-based scaling along with webhook-based scaling?

stephanosio commented 1 year ago

My current understanding of the problem says that this can be a fundamental design issue which can't be fixed. In short, you'd better avoid using both pull-based and webhook-based scaling at the same time, if that works for you.

I assumed this was a supported configuration because the documentation mentions "PercentageRunnersBusy + Webhook-based autoscaling."

I suppose it was initially working but some later change(s) broke it?

genisd commented 1 year ago

I used to try this myself, because of a race condition (bug long fixed) where 4 workers were needed but only 3 were booted leading to starvation. And indeed the combination yielded unexpected * 2 scaleups.

But nowadays the webhook scaling is much better, dare I say it rock solid (our current experience). We run PR builds where we need some 30 odd workers + 4 big workers and we're having a splendid time right now.

So perhaps you had similar issues with past versions, but it works really well right now and might be worth reevaluating.

stephanosio commented 1 year ago

So perhaps you had similar issues with past versions, but it works really well right now and might be worth reevaluating.

I have tried webhook scaling-only configuration with the latest ARC release (2.6.0 0.26.0 as of now) and stress tested it extensively -- it seemed to work ok for most of the time; but, there were some occasions where it just broke down and stopped scaling until all the running workflows got completed; even re-delivering the "queued" webhooks many times would not make it launch a new instance.

I have not done an in-depth analysis of this; but, in general, it seemed to occur when many workflow runs get cancelled and restarted (e.g. when PRs are force-pushed).

genisd commented 1 year ago

I have not done an in-depth analysis of this; but, in general, it seemed to occur when many workflow runs get cancelled and restarted (e.g. when PRs are force-pushed).

This likely is it. We don't limit concurrency and abort previous commit runs, but the few times I've done it manually I can confirm what you're describing. Not sure where exactly it's going wrong.

mumoshu commented 1 year ago

Probably you might see a lot of workflow_job events whose statuses are cancelled flowing into ARC's webhook server when it happens.

ARC's webhook scaling doesn't correlate a cancel workflow_job events with its corresponding queued workflow_job event that triggered a scale-up. That means, if you've instructed GitHub Actions to send us a lot of canceled workflow_job events (by manually cancelling the workflow run and the jobs), it can "hide" queued workflow_job events that are concurrently sent to ARC's webhook server by rerunning a workflow job.

It would get worse if you had a low minReplicas value ot a low maxReplicas value in your HRA spec. A short-term solution to alleviate your issue would be to make minReplicas large enough so that you can keep N runners available in the worst case. Otherwise, try setting maxReplicas larger enough than the largest number of concurrently triggered jobs. For example, if you have a 10x10 build matrix that's the largest in all of your workflow definitions, set maxReplicas to N + 100 so that even if you canceled the 10x10 jobs it would still leave N runners available for immediate use.

mumoshu commented 1 year ago

I assumed this was a supported configuration because the documentation mentions "PercentageRunnersBusy + Webhook-based autoscaling."

Note that even though the documentation can be seen as saying the combination of the two is useful when you want to scale-from-zero while using webhook-based autoscaling, it's not necessary today.

I remember that GitHub Actions enhanced their backend to NOT immediately mark the job as failed when there were no runners available at the time when the job is triggered. I thought there was a GitHub blog post that explains the new behavior but I can't find the URL.

mumoshu commented 1 year ago

I'll re-read through our documentation and try to update it not to recommend combining pull-based with webhook-based autoscaling.

damyan90 commented 1 year ago

Unfortunately I have to +1 the issue with webhook scaling. I have it set to scale between 0 and 20 replicas and it sometimes fails to create any runners at all, which results in some of my jobs being queued forever. I thought previously that it's caused by those "cancelled" jobs, but it happens also without them. Using 0.26 version.

Let me know if I can help troubleshooting it somehow.

Edit: Here's something interresting. When looking at the HPA the desiredReplicas is set already to proper number (9), but scaling does not happen (as seen on the below tile) 2022-12-05_13-29

mumoshu commented 1 year ago

@damyan90 Hey. You're always welcome to fill in our bug report form with full details! But again, we've already updated our docs not to recommend enabling both pull and webhook-based scaling simultaneously. Only use webhook-based scaling with enough minReplicas for maximum reliability.

damyan90 commented 1 year ago

@mumoshu I saw the docs. In the screenshots it's scaling based on metric only, since I was trying to mitigate the issue today. I will observe it for the next few days, with just a webhook scaling and report it if it occurs more frequent, ideally in some repetitive scenarios.

Well, I'm against setting minReplicas to overcome the issue here. I've got pipelines which require at least 4 of such runners and if more are triggerred then such scaling doesn't make sense at all from the costs perspective.

I would highly appreciate a reliable scaling from 0 as I currently have it set.

mumoshu commented 1 year ago

You might have already figured out it but then you can just set minReplicas to 0 with only webhook-based scaling enabled. I recommend setting it no non-zero for availability in the worst case like you canceled a huge matrix of workflow jobs and it resulted in GitHub Actions sending a lot of workflow_job status=canceled webhook events which prevents ARC from scaling for some time. If you can deal with that, minReplicas=0 would be the best for your use-case.

damyan90 commented 1 year ago

it resulted in GitHub Actions sending a lot of workflow_job status=canceled webhook events which prevents ARC from scaling for some time

Do we know why that happens? Also, what does the "some time" mean in this case? I'm curious if this is ever going to be changed too, so that those cancellation don't cause a scaling issue for the ARC. Thanks!

mumoshu commented 1 year ago

It happens because GitHub Actions and their webhook functionality is designed that way. ARC basically scales down the target RunnerDeployment by 1 on each workflow_job event of completion/canceled webhook event received. If GitHub delayed sending any webhooks, the autoscaling operation is delayed. It won't change in short term.

damyan90 commented 1 year ago

I'm not so sure that's an issue with how GitHub sends the cancelled/completion webhook.

I just observed that after having 1 job cancelled an hour ago, it couldn't schedule any agents for the next 30mins at least. My current setup for scaling is:

Maybe it's worth mentioning that I have 2 nodepools in the cluster, one for each runner and it happens mostly on the one where nodepool is set to be scaled down to 0.

I can open a separate issue for that, just let me know what kind of information would be helpful for you to check it. I'll try to capture as much as possible during such behaviour.

mumoshu commented 1 year ago

I just observed that after having 1 job cancelled an hour ago, it couldn't schedule any agents for the next 30mins

@damyan90 Is it so even after you manually rerun any workflow run that is stuck pending? Also, can you probably see some webhook deliveries not being successful in your repository/organization/app's Webhook history page on GitHub?

damyan90 commented 1 year ago

I would have to cancel it first and then re-run?

Let's continue maybe here: https://github.com/actions-runner-controller/actions-runner-controller/issues/2073

Nuru commented 1 year ago

@mumoshu I believe the problem here is that in suggestReplicasByPercentageRunnersBusy, it looks at how many runners are running (including the runners reserved by the webhook autoscaler) and, when scaling up, includes them in the number of suggested replicas. However, the suggest functions are only supposed to suggest the number of metrics-based runners to run, which then gets added to the number reserved by the webhook autoscaler to come up with the final number to set as newDesiredReplicas.

Not quite sure how you want to fix this. I think I would reduce desiredReplicasBefore and numRunnersBusy+numTerminatingBusy by len(hra.Spec.CapacityReservations) before doing any computations with them.

adkafka commented 11 months ago

@Nuru I came to the same conclusion. I think subtracting the number of reservations from this line: https://github.com/actions/actions-runner-controller/blob/b511953df7a8a752faf0a0f840bd832e02796c10/controllers/actions.summerwind.net/autoscaling.go#L328

would help. Then, in the caller, it would re-add this value, and we'd be all set.