actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.72k stars 1.12k forks source link

PercentageRunnersBusy does not scale up for very short (<1 minute) jobs #2711

Open RalphSleighK opened 1 year ago

RalphSleighK commented 1 year ago

Checks

Controller Version

0.23.3

Helm Chart Version

actions-runner-controller-0.23.3

CertManager Version

1.8.2

Deployment Method

Helm

cert-manager installation

Yes, not a cert-manager issue.

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: {{ .Values.runner.autoscalername }}
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: {{ .Values.runner.runnername }}
  minReplicas: 1
  maxReplicas: 40
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '1.5'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.5'      # The scale down multiplier factor applied to desired count
  scheduledOverrides:
  # Override minReplicas to 1 only between 1900 and 0600
  - startTime: "2023-06-16T19:00:00+00:00"
    endTime: "2023-06-17T06:00:00+00:00"
    recurrenceRule:
      frequency: Daily
    minReplicas: 1

To Reproduce

1) Queue up a large number of very short jobs against your runners (~30 seconds each)

Describe the bug

When the jobs are very short, the runners do not scale up to meet demand, screenshot shows 4 tests:

1: 200x2min jobs 2: 200x1min jobs 3: 200x30sec jobs 4: 200x20sec jobs

Screenshot 2023-06-29 at 11 59 26

The shorter two cases are not scaling to meet the demand. Changing the syncPeriod from the default 1min to 10secs does not help.

This leads to other jobs seeing queue times until all the short jobs are done.

Describe the expected behavior

I would expect the runners to scale up to meet the demand

Whole Controller Logs

One sync period of the controller:

https://gist.github.com/RalphSleighK/19568ccbd7ca5927ba1c048b4c42b138

Whole Runner Pod Logs

n/a

Additional Context

From the controller logs, looks like we are undercounting busy runners, maybe because the list of busy runners from the Github API and the list of runner pods the autoscaler thinks its looking after don't overlap in the case of very short jobs.

I suspect this issue is a fairly fundamental limitation of this scaling method and moving over to webhook based scaling may be the solution here, but figured its worth opening.

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Blackgoast commented 1 year ago

apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: {{ .Values.runner.autoscalername }} spec: scaleTargetRef: kind: RunnerDeployment name: {{ .Values.runner.runnername }} minReplicas: 1 maxReplicas: 40 metrics:

sap147 commented 1 year ago

@RalphSleighK what metric are you using to graph "runners busy" . Thanks.

RalphSleighK commented 1 year ago

@RalphSleighK what metric are you using to graph "runners busy" . Thanks.

@sap147 This is the horizontalrunnerautoscaler_runners_busy metric the controller supplies via prometheus and grafana.