Open uralsemih opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
@uralsemih we also have issues similar to yours with webhook auto-scaling
My issue is this: webhooks are coming everything is green ARC doesn't spawn more runners to take the jobs, and sometimes it does after 10-20 minutes or sometimes I need to re-trigger the job for it to be assigned to a runner also scaling from 0 seems to be worse
Hi @mumoshu This is the 4-5th issue I see with issues regarding the webhook scaling ( no matter if it's from 0 or not)
Others that are related: https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1439016359 https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1436038160 https://github.com/actions/actions-runner-controller/issues/2073#issuecomment-1439016359 https://github.com/actions/actions-runner-controller/issues/2254
I think it could be a bug on the latest version is there any ETA for checking the issue? maybe someone else can help here?
Thanks for the hard work!
Hello @omer2500 I made several changes which helped a bit.
I increased the scaleUpTriggers.duration to 12h
which was previously 30m
. We have a bunch of workflows where that take over 4-5 hours to complete, which seems 30m too fast to hit ARC for scaling.
scaleUpTriggers:
- duration: 12h
githubEvent:
workflowJob: {}
Please have a look at scheduledOverrides
if you are not using it. It's a cool feature where you can define minReplica for a specific period.
spec:
maxReplicas: 100
minReplicas: 10
scaleDownDelaySecondsAfterScaleOut: 300
scaleTargetRef:
kind: RunnerDeployment
name: gha-staging-8c-32gb
scaleUpTriggers:
- duration: 12h
githubEvent:
workflowJob: {}
scheduledOverrides:
startTime: "2023-02-25T00:00:00+01:00"
- endTime: "2023-02-24T22:00:00+01:00"
minReplicas: 50
recurrenceRule:
frequency: Daily
I check the codebase and I think, this PR is related to what we are facing. Looking forward to see in next release. https://github.com/actions/actions-runner-controller/pull/2258
@uralsemih @omer2500 Hey! Thank you for reporting. I believe your issue is basically this:
This is a long-standing issue that we are going to fix via #2258, as @uralsemih has kindly mentioned! If you need a workaround today, try setting minReplicas to a large value. If you are concerned about cost of setting a large value for it, use scheduled overrides. ARC's scheduled overrides allow you to set large minReplicas only for your working hours, which might give you the right balance between cost and reliability.
Once #2258 is released, you can omit the workaround. However, keep in mind that ARC relies on webhook for scaling. There's no guarantee ARC can receive every webhook event sent by GitHub. If ARC missed receiving a status=queued
workflow_job event, it would miss a scale-up anyway. That said, it's still your responsibility to set a correct minReplicas
even after #2258. If you absolutely need to keep, say 10, runners no matter how many webhook events ARC missed, set minReplicas of 10. #2258 would make scale-up more reliable though, which might allow you to use lower minReplicas with fewer issues than today.
You'd also need to properly configure scale trigger duration, regardless of #2258. The duration needs to be considerably longer than the max duration of all your workflow jobs. Otherwise the runner replica added via a workflow job might "expire" and gets deleted before the runner is actually used by any job.
We're still seeing some scaling issues since #2258 was released. Particularly, if we have the minReplicas
set to 1. Basically, my single replica correctly picks up jobs, but won't pick up all 5 jobs at once, which it should. It also doesn't seem to scale up if there's 3 running jobs as minReplicas
, so it just seems like it's not really scaling up at all.
NOTE: I know that realistically, we'd set a
minReplicas
to 5 or 10 or something just so there's always some replicas running. But I'm setting it to lower for testing purposes only.
Am I missing a setting or a configuration in the horizontal runner autoscaler? Or am I incorrect and #2258 has not actually been released yet?
0.24.4
0.23.3
1.10.0
Helm
Yes
@emmahsax Hey! Thanks for reporting. From what you provided, I can only say it should just work. What's your maxReplicas when it doesn't work as you expect?
If my question about maxReplicas doesn't help, I'd appreciate it if you could file a dedicated issue for your case! Our bug report form is full of important fields, which is super helpful for debugging this kind of issue.
Still experiencing the issue myself. I have an even simpler setup than @emmahsax as it pretty much follows the bare minimum in the documentation and no matter what I try I can only get one job to run at a time. There is no autoscaling taking place at all.
I will say that I did realize I misunderstood the docs. I was reading from the docs here, and didn't realize that the Install with Helm
portion was actually required (or the Install with Kustomize
), but that either way you had to tell GitHub to send webhooks 🤦🏼♀️ .
Therefore, I'm still in process of setting that stuff up (got distracted with other things). Who knows if that would fix things.... 🤞🏼
So I've been running into problems with the webhooks scaling implementation as well. I've got this deployed across several repositories right now, and I'm facing similar issues as other end users. Looking at the logs, what ends up eventually happening each time is that somehow the count of how many desired replicas need to be scheduled gets out of whack with how many things need to actually scheduled. I haven't pinpointed exactly why, whether it's some miscalculation, some webhooks are getting double counted, or something else.
For reference, what we do is have ARC schedule the pod, and if there isn't a machine, we autoscale it up with Karpenter - so the runner will be stuck in pending for a minute or two while aws brings up a machine to schedule it on. I'm not sure if the pod being stuck in pending like that contributes or not to counts getting out of whack, but I also think it's related to the amount of up and down activity happening in the repository - we have an internal repository with significantly less traffic that scales to 0 using the webhooks based implementation and it doesn't seem to have any issues like airbytehq/airbyte does.
I did pick up a canary build to deploy https://github.com/actions/actions-runner-controller/pull/2502. It does look like it improved the situation slightly, but webhooks scaling appears to still basically be unusable on its own for the amount of traffic on airbytehq/airbyte - jobs will be stuck in pending for hours waiting to get a runner when they should be immediately scaling up.
What I put in yesterday as a workaround seems to have basically fixed it - I put in pull based autoscaling as well that looks like this:
Now the pull based will follow behind whenever webhooks gets out of whack and slowly resolve the issue. It's of course not ideal, but it at least makes the implementation usable while giving us the benefits of (usually) spawning runners faster than pull based alone. I will probably end up tweaking anti-flapping config to be a bit shorter under this setup to save a little bit of money as currently with a lot of inbounding scale ups the anti-flapping takes awhile to bring things back down.
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: "720m"
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '1.0' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
scaleDownThreshold: '0.99' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
scaleUpAdjustment: 1 # The scale up runner count added to desired count
scaleDownAdjustment: 1 # The scale down runner count subtracted from the desired count
@cpdeethree I have a couple of suggestions for things to look at regarding your issues with HRA not scaling up your busy runner deployments adequately. I am including issues that are probably not affecting you so that others reading this issue can benefit.
First is that HRA.spec.scaleUpTriggers[].duration
needs to be set to to a duration that is longer than you expect a job to run, from the time it is queued to the time the job is finished, assuming a runner is available. I see you set it to "720m" so that is probably long enough, but still check to make sure. Every time a job takes longer than this duration, your runner group will be scaled down 1 runner, and you will be perpetually under-resourced until you hit your minimum number of runners. (And if your minimum is zero, your capacity may never recover.)
Of course, make sure that HRA.spec.maxReplicas
is set high enough to accommodate all the jobs you want to run in parallel. Again, if your pull-based scaling is working, then you probably have this set adequately, too.
With that out of the way, my guess is that you are running jobs with runs-on
selectors that match more than one runner group. Perhaps you are matching both a repository and an organization runner group. The webhook-based autoscaler will scale the first runner group it finds (see #2798), which may not be the one the job runs on. Every time the wrong group is scaled, you will end up with a job waiting on the queue. Then if your minimum group size is too small, it will not work through the backlog before you start getting new jobs queued up.
Following up here. I did end up changing HRA.spec.scaleUpTriggers[].duration to > 24 hours (which is what our max ci timeout was in fact set to) and that appears to have resolved the issue. It’s a shame that there’s no easy way to configure it so that you can’t footgun yourself like that, but we are talking two entirely disparate systems here. Maybe a simple cron that checks periodically to ensure that the two values make sense together is warranted, but that would probably have to be something that we implement ourselves
Checks
Controller Version
v0.27.0
Helm Chart Version
0.22.0
CertManager Version
1.5.3
Deployment Method
Helm
cert-manager installation
Yes
Checks
Resource Definitions
To Reproduce
Describe the bug
We are facing long queue times where workflow/job can not able to pick up a runner. I am not sure whether
Failed to update runnerreplicaset resource
andRunner pod is annotated to wait for completion, and the runner container is not restarting
give some hints.When we set manually or increase the minReplica, the workflow can able to pick up a runners as you can see from below picture, but I don't think we should manually manage this.
Here are some logs from controller.
Describe the expected behavior
Runner should be scaled up and scaled down properly.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
We have different RunnerDeployment and HRA per namespaces for different runner groups as below.