Open pavelsmolensky opened 3 years ago
@pavelsmolensky Hey! This is a known issue that is thought to happen when you missed upgrading CRDs. You probably need to stop the controller, remove all the runner related resources, and then update CRDs and restart the controller.
If you haven't read it yet, we document it at the beginning of each release note:
https://github.com/actions-runner-controller/actions-runner-controller/releases/tag/v0.20.2
Unfortunately, we don't know what's the root cause(probably some missing validation code in our controller? I'd appreciate it very much if you could help isolating where's the problematic code is! Thanks.
Hi @mumoshu, thanks for the prompt reply! Actually I always removed all the CRDs horizontalrunnerautoscalers.actions.summerwind.dev runnerdeployments.actions.summerwind.dev runnerreplicasets.actions.summerwind.dev runners.actions.summerwind.dev runnersets.actions.summerwind.dev (hope nothing is missing) in advance and all the CRDs in my case were installed by helm alongside with the controller. Did that for all of 4 tested versions of helm chart (0.12.8 - 0.13.1)
@pavelsmolensky Just to be extra sure, have you also stopped the actions-runner-controller's controller-manager and other pods before upgrading CRDs?
Also- what's your K8s version? If the chart v0.13.0 or greater makes a difference, I'd guess that's related to the fact that our CRDs have upgraded from extensions/v1beta1 to extensions/v1.
@mumoshu first I remove the CRDs, after they are removed - uninstall actions-runner-controller chart. K8S version is 1.18.19 (far not the latest)
@pavelsmolensky Ah I've already nuked 1.18 from my manual test matrix so probably it works for e.g. 1.20...?
https://github.com/actions-runner-controller/actions-runner-controller/pull/803
Anyway, you'd better stop the controller before removing CRDs as doing the opposite might result in undefined behaviour.
@mumoshu it makes sense now, probably no one else reports this issue because of quite the old k8s version I'm using. I'll try it on a newer version, hope it will work well there. Many thanks for your professional support, appreciate it!
Maybe let's keep the issue open for a while, I'll put here my updates on newer k8s version findings.
@pavelsmolensky Thanks for confirming. I'm really looking forward to hear your updates!
It would be super helpful if you could share actions-runner-controller's controller-manager
logs so that I can probably see errors that may guide us to the root cause. actions-runner-controller should do it's best to not spam your cluster, at least...
@mumoshu aren't they here https://github.com/actions-runner-controller/actions-runner-controller/files/7300185/gh-actions-actions-runner-controller-86c4cdfccc-tzxb5-1633593533072392200.log (bottom links in the bug description)? Or do you mean something else? They are not full in the attach, but the pattern is the same, until pods start hitting resource quota.
@pavelsmolensky Thanks for sharing the log! I somehow missed that in your original report.
Well, I read it, and it turns out there're no error messages. Then it might be due to that some "normal" code path ends up letting a broken runnerreplicaset to create a lot of runners, OR a broken runnerdeployment to create a lot of runnerreplicasets.
Do you remember which was your case? Did you see a lot of runnerreplicasets? Or did you see only one runnerreplicaset per runnerdeployment but a lot of runners?
@mumoshu there were lots of runnereplicaset and lots of runners, so runnerdeployment is a culprit, I would say
@mumoshu some good news: I have tested the latest controller on k8s v1.19.15 - this issue is not reproducible anymore. So indeed the problem was related to k8s 1.18 version. Maybe it's worth mentioning somewhere in the docs, otherwise, I don't mind closing the ticket. Thank you!
I can confirm that in k8s 1.18 I cannot get the CRDs to sync properly with ArgoCD. I've never had this issue in any previous upgrades of controller.
So deleting the CRDs and creating them again, I will end up with this in ArgoCD
Describe the bug RunnerDeployment creates an uncontrolled number of Runners. Starting from helm chart version 0.13.0
Checks
To Reproduce Steps to reproduce the behavior:
helm upgrade --install -n gh-actions actions-runner-controller actions-runner-controller/actions-runner-controller --set authSecret.create=true --set scope.watchNamespace=gh-actions --set scope.singleNamespace=true --set securityContext.runAsUser=999 --set authSecret.github_token=*** --version "0.13.1"
kubectl -n gh-actions apply -f org-runner-deployment.yaml
RunnerReplicaSet
,Runner
andPod
resources => Their amount grows without a stop. In my case I got about 25000 runners before I realized what's happening. Pods eventually cannot start by hitting request limits. There are so many github API requests happens that quota is reached and it becomes very difficult to cleanup all the created resources (because of the Runner finalizers)Expected behavior 1 RunnerReplicaSet, 1 Runner and one Pod should be created
Environment (please complete the following information):
Additional context Controller logs: gh-actions-actions-runner-controller-86c4cdfccc-tzxb5-1633593533072392200.log List of runners after a couple of seconds (exported from k9s): runners-gh-actions-1633593544480491500.csv