Closed damyan90 closed 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I would have to cancel it first and then re-run?
Exactly. It should have worked if you had done so.
I'm experiencing a very similar issue. In my case I have minReplicas: 0
trying to scale up from 0 since these particular runners we need are quite large and don't want to keep idle ones running.
I would have to cancel it first and then re-run?
Exactly. It should have worked if you had done so.
Well, even if. Can you imagine reaction of developers who are supposed to do that for their CI pipelines whenever that thing is stuck? I would be frustrated ;)
Of course, I can 😄 However my conclusion was that, as a "developer" who uses ARC, I'd never want another system like Memcached, Redis, MySQL, etc., just to solve that problem. It isn't only about a code enhancement. But that's not the end of the world. Have you already read GitHub's announcement made on ARC's Discussions page?
Just did! That's awesome! ;)
i have the same problem, what I'm seeing though, is for matrix jobs its not dispatching an event for each matrix item. Which would be more related to a github platform issue.
I have the same issue. And ARC in webhook mode is impossible to use with matrix builds at all. I am considering of backing the webhook with a second pull controller, so I can get the best of the two worlds. I wonther if anyone have tried that.
@mumoshu @damyan90 we are experiencing the same issue, did you solve it? seems like a existing bug we do see webhook (also 100% success) event for every matrix item and no errors to be found but when i expect for 20 runner to spawn up i see only 3-4 that were created the rest stays in queued for a long time.
UPDATE: Well we had 2 replicas for the webhook server after reduced it to 1 everything is working as it should
might be a racing issue? what do you think?
Not sure. I have 2 replicas for the webhook as well as for the controller and it works fine for now. I don't see too many issues, but I'm also not downscaling to 0. So might be that I'm just hiding the issue a bit. Waiting for a new versions though. There's been some development together with Github on the autoscaling matter, for now available only to some beta testers. I suppose these issues were addressed there too.
Thanks for the info @damyan90 actually after more testing even 1 webhook server pod didnt solve the issue
imagine that we have matrix job with 20 tasks but only 10 of them are scaling up the other 10 just hangs or some of them might be able to run, its a little bit random in my case down scaling to zero is a must because im using expensive machines hope that they will solve it soon.
scaling with webhook from 0 or 1 replica is not working for us either
controller version: 0.27.0
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: runner56-deployment-autoscaler
spec:
scaleDownDelaySecondsAfterScaleOut: 600
scaleTargetRef:
name: runner56-deployment
minReplicas: 0
maxReplicas: 3
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: "30m"
Deliveries on GitHub are green. How can we debug it further?
We are expecting very similar issue.
Workflow queued, webhook sent from Github, github-webhook-server received the requests and trying to patch HRA. see the logs below:
2023-03-03T14:01:16Z DEBUG controllers.webhookbasedautoscaler Patching hra my-failing-hra for capacityReservations update {"before": 0, "expired": -1, "added": 1, "completed": 0, "after": 1}
It's saying "before": 0
, but in fact there is already a Runner (pod) up and running some job. And new Runner/pod is not creating.
So, in the end, all the next attempts to scale the RunnerDeployment are failing until the Job is finished on the existing "unrecognized" Runner.
@mumoshu, any ideas what can be the root cause?
Thanks in advance!
Hi,
Having the same issue as above. Would be really nice if someone could take a look and resolve it or at least push it somewhere.
Hey @omer2500 @adziura-ledger @memdealer @shirkevich! I can't say for sure without seeing the full log of your ARC controller-manager and runner pods, but my theory is that your issue is equal to https://github.com/actions/actions-runner-controller/issues/2306#issuecomment-1445492738.
That is, ARC stops scaling up when it reaches maxReplicas. It happens more often when you have a large divergence between the actual max concurrent jobs queued at a time and maxReplicas.
Imagine you have 10x10 matrix in one of your workflow jobs... I guess you are affected by the issue when your maxReplicas is less than 1000.
It's already fixed in our main branch via #2258. So I would very much appreciate it if you could test it by building an ARC docker image from the head of our main branch and deploy it in your test environment!
@mumoshu i installed it about a week ago on our production env actually (we are on the move to github actions so its fine) i managed to spawn up 52 runners for 52 jobs that came from 10 pull requests the min replicas was 1 and the max replicas was 80, and it looks good so far!
i will try it with scaling from zero and update
Checks
Controller Version
0.26.0
Helm Chart Version
0.21.1
CertManager Version
v1.10.1
Deployment Method
Helm
cert-manager installation
source: https://charts.jetstack.io Values:
Standard
helm upgrade --install
Checks
Resource Definitions
To Reproduce
Describe the bug
Scaling is sometimes stuck for no visible reason. Editiing the horizontalrunnerautoscalers.actions.summerwind.dev and forcing
minReplicas
to be higher than current value also doesn't take place during that period.No errors in any ACR component.
Jobs are pending in
Queued
state for up to 1hour during the day.Describe the expected behavior
Runners are scaled up/down based on the number of queued jobs in Github workflow.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
Webhook delivery: https://user-images.githubusercontent.com/24733538/205882331-086724a1-48b7-4e3f-ae77-23d35f959d02.png - 100% successful.
Runner's definition: