PercentageRunnersBusy does not scale up for very short (<1 minute) jobs

RalphSleighK commented 1 year ago

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.23.3

Helm Chart Version

actions-runner-controller-0.23.3

CertManager Version

1.8.2

Deployment Method

Helm

cert-manager installation

Yes, not a cert-manager issue.

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: {{ .Values.runner.autoscalername }}
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: {{ .Values.runner.runnername }}
  minReplicas: 1
  maxReplicas: 40
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '1.5'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.5'      # The scale down multiplier factor applied to desired count
  scheduledOverrides:
  # Override minReplicas to 1 only between 1900 and 0600
  - startTime: "2023-06-16T19:00:00+00:00"
    endTime: "2023-06-17T06:00:00+00:00"
    recurrenceRule:
      frequency: Daily
    minReplicas: 1

To Reproduce

1) Queue up a large number of very short jobs against your runners (~30 seconds each)

Describe the bug

When the jobs are very short, the runners do not scale up to meet demand, screenshot shows 4 tests:

1: 200x2min jobs 2: 200x1min jobs 3: 200x30sec jobs 4: 200x20sec jobs

Screenshot 2023-06-29 at 11 59 26

The shorter two cases are not scaling to meet the demand. Changing the syncPeriod from the default 1min to 10secs does not help.

This leads to other jobs seeing queue times until all the short jobs are done.

Describe the expected behavior

I would expect the runners to scale up to meet the demand

Whole Controller Logs

One sync period of the controller:

https://gist.github.com/RalphSleighK/19568ccbd7ca5927ba1c048b4c42b138

Whole Runner Pod Logs

n/a

Additional Context

From the controller logs, looks like we are undercounting busy runners, maybe because the list of busy runners from the Github API and the list of runner pods the autoscaler thinks its looking after don't overlap in the case of very short jobs.

I suspect this issue is a fairly fundamental limitation of this scaling method and moving over to webhook based scaling may be the solution here, but figured its worth opening.

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Blackgoast commented 1 year ago

apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: {{ .Values.runner.autoscalername }} spec: scaleTargetRef: kind: RunnerDeployment name: {{ .Values.runner.runnername }} minReplicas: 1 maxReplicas: 40 metrics:

type: PercentageRunnersBusy scaleUpThreshold: '0.75' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up scaleDownThreshold: '0.3' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down scaleUpFactor: '1.5' # The scale up multiplier factor applied to desired count scaleDownFactor: '0.5' # The scale down multiplier factor applied to desired count scheduledOverrides:
Override minReplicas to 1 only between 1900 and 0600
startTime: "2023-06-16T19:00:00+00:00" endTime: "2023-06-17T06:00:00+00:00" recurrenceRule: frequency: Daily minReplicas: 1

sap147 commented 1 year ago

@RalphSleighK what metric are you using to graph "runners busy" . Thanks.

RalphSleighK commented 1 year ago

@RalphSleighK what metric are you using to graph "runners busy" . Thanks.

@sap147 This is the horizontalrunnerautoscaler_runners_busy metric the controller supplies via prometheus and grafana.

actions / actions-runner-controller