actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.68k stars 1.11k forks source link

RunnerDeployment creates an uncontrolled number of Runners #875

Open pavelsmolensky opened 3 years ago

pavelsmolensky commented 3 years ago

Describe the bug RunnerDeployment creates an uncontrolled number of Runners. Starting from helm chart version 0.13.0

Checks

To Reproduce Steps to reproduce the behavior:

  1. Install actions-runner-controller via helm chart (0.13.0, 0.13.1, 0.13.2 versions have this issue, 0.12.8 and below - don't) helm upgrade --install -n gh-actions actions-runner-controller actions-runner-controller/actions-runner-controller --set authSecret.create=true --set scope.watchNamespace=gh-actions --set scope.singleNamespace=true --set securityContext.runAsUser=999 --set authSecret.github_token=*** --version "0.13.1"
  2. Install RunnerDeployment kubectl -n gh-actions apply -f org-runner-deployment.yaml
    apiVersion: actions.summerwind.dev/v1alpha1
    kind: RunnerDeployment
    metadata:
    name: runner-deployment-1
    spec:
    replicas: 1
    template:
    spec:
      repository: existing-repo-name/existing-project-name
  3. Observe RunnerReplicaSet, Runner and Pod resources => Their amount grows without a stop. In my case I got about 25000 runners before I realized what's happening. Pods eventually cannot start by hitting request limits. There are so many github API requests happens that quota is reached and it becomes very difficult to cleanup all the created resources (because of the Runner finalizers)

Expected behavior 1 RunnerReplicaSet, 1 Runner and one Pod should be created

Environment (please complete the following information):

Additional context Controller logs: gh-actions-actions-runner-controller-86c4cdfccc-tzxb5-1633593533072392200.log List of runners after a couple of seconds (exported from k9s): runners-gh-actions-1633593544480491500.csv

mumoshu commented 3 years ago

@pavelsmolensky Hey! This is a known issue that is thought to happen when you missed upgrading CRDs. You probably need to stop the controller, remove all the runner related resources, and then update CRDs and restart the controller.

If you haven't read it yet, we document it at the beginning of each release note:

https://github.com/actions-runner-controller/actions-runner-controller/releases/tag/v0.20.2

Unfortunately, we don't know what's the root cause(probably some missing validation code in our controller? I'd appreciate it very much if you could help isolating where's the problematic code is! Thanks.

pavelsmolensky commented 3 years ago

Hi @mumoshu, thanks for the prompt reply! Actually I always removed all the CRDs horizontalrunnerautoscalers.actions.summerwind.dev runnerdeployments.actions.summerwind.dev runnerreplicasets.actions.summerwind.dev runners.actions.summerwind.dev runnersets.actions.summerwind.dev (hope nothing is missing) in advance and all the CRDs in my case were installed by helm alongside with the controller. Did that for all of 4 tested versions of helm chart (0.12.8 - 0.13.1)

mumoshu commented 3 years ago

@pavelsmolensky Just to be extra sure, have you also stopped the actions-runner-controller's controller-manager and other pods before upgrading CRDs?

Also- what's your K8s version? If the chart v0.13.0 or greater makes a difference, I'd guess that's related to the fact that our CRDs have upgraded from extensions/v1beta1 to extensions/v1.

pavelsmolensky commented 3 years ago

@mumoshu first I remove the CRDs, after they are removed - uninstall actions-runner-controller chart. K8S version is 1.18.19 (far not the latest)

mumoshu commented 3 years ago

@pavelsmolensky Ah I've already nuked 1.18 from my manual test matrix so probably it works for e.g. 1.20...?

https://github.com/actions-runner-controller/actions-runner-controller/pull/803

Anyway, you'd better stop the controller before removing CRDs as doing the opposite might result in undefined behaviour.

pavelsmolensky commented 3 years ago

@mumoshu it makes sense now, probably no one else reports this issue because of quite the old k8s version I'm using. I'll try it on a newer version, hope it will work well there. Many thanks for your professional support, appreciate it!

Maybe let's keep the issue open for a while, I'll put here my updates on newer k8s version findings.

mumoshu commented 3 years ago

@pavelsmolensky Thanks for confirming. I'm really looking forward to hear your updates! It would be super helpful if you could share actions-runner-controller's controller-manager logs so that I can probably see errors that may guide us to the root cause. actions-runner-controller should do it's best to not spam your cluster, at least...

pavelsmolensky commented 3 years ago

@mumoshu aren't they here https://github.com/actions-runner-controller/actions-runner-controller/files/7300185/gh-actions-actions-runner-controller-86c4cdfccc-tzxb5-1633593533072392200.log (bottom links in the bug description)? Or do you mean something else? They are not full in the attach, but the pattern is the same, until pods start hitting resource quota.

mumoshu commented 3 years ago

@pavelsmolensky Thanks for sharing the log! I somehow missed that in your original report.

Well, I read it, and it turns out there're no error messages. Then it might be due to that some "normal" code path ends up letting a broken runnerreplicaset to create a lot of runners, OR a broken runnerdeployment to create a lot of runnerreplicasets.

Do you remember which was your case? Did you see a lot of runnerreplicasets? Or did you see only one runnerreplicaset per runnerdeployment but a lot of runners?

pavelsmolensky commented 3 years ago

@mumoshu there were lots of runnereplicaset and lots of runners, so runnerdeployment is a culprit, I would say

pavelsmolensky commented 3 years ago

@mumoshu some good news: I have tested the latest controller on k8s v1.19.15 - this issue is not reproducible anymore. So indeed the problem was related to k8s 1.18 version. Maybe it's worth mentioning somewhere in the docs, otherwise, I don't mind closing the ticket. Thank you!

peimanja commented 3 years ago

I can confirm that in k8s 1.18 I cannot get the CRDs to sync properly with ArgoCD. I've never had this issue in any previous upgrades of controller.

So deleting the CRDs and creating them again, I will end up with this in ArgoCD

image