Open RalphSleighK opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: {{ .Values.runner.autoscalername }} spec: scaleTargetRef: kind: RunnerDeployment name: {{ .Values.runner.runnername }} minReplicas: 1 maxReplicas: 40 metrics:
@RalphSleighK what metric are you using to graph "runners busy" . Thanks.
@RalphSleighK what metric are you using to graph "runners busy" . Thanks.
@sap147 This is the horizontalrunnerautoscaler_runners_busy metric the controller supplies via prometheus and grafana.
Checks
Controller Version
0.23.3
Helm Chart Version
actions-runner-controller-0.23.3
CertManager Version
1.8.2
Deployment Method
Helm
cert-manager installation
Yes, not a cert-manager issue.
Checks
Resource Definitions
To Reproduce
Describe the bug
When the jobs are very short, the runners do not scale up to meet demand, screenshot shows 4 tests:
1: 200x2min jobs 2: 200x1min jobs 3: 200x30sec jobs 4: 200x20sec jobs
The shorter two cases are not scaling to meet the demand. Changing the syncPeriod from the default 1min to 10secs does not help.
This leads to other jobs seeing queue times until all the short jobs are done.
Describe the expected behavior
I would expect the runners to scale up to meet the demand
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
From the controller logs, looks like we are undercounting busy runners, maybe because the list of busy runners from the Github API and the list of runner pods the autoscaler thinks its looking after don't overlap in the case of very short jobs.
I suspect this issue is a fairly fundamental limitation of this scaling method and moving over to webhook based scaling may be the solution here, but figured its worth opening.