FairwindsOps / rbac-manager

A Kubernetes operator that simplifies the management of Role Bindings and Service Accounts.
https://fairwinds.com
Apache License 2.0
1.46k stars 117 forks source link

Issue starting rbac-manager in EKS 1.25 #440

Closed ranferimeza closed 6 months ago

ranferimeza commented 9 months ago

What happened?

rbac-manager is stuck in a crashloopbackoff error after showing the following error on EKS 1.25. It was running fine on EKS 1.24:

time="2023-12-13T19:59:59Z" level=error msg="[failed to wait for rbacdefinition caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.RBACDefinition, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]: unable to run the manager"

I see the CRDs are created, so I am unable to identify what is causing this problem.

What did you expect to happen?

rbac-manager to run without issues.

How can we reproduce this?

Just try to install rbac-manager using helmsman. I only supply tolerations to make it run on a specific node group, but the rest of the values supplied are the chart defaults.

Version

rbac-manager-1.18.0

Search

Code of Conduct

Additional context

No response

sudermanjr commented 9 months ago

We've run on EKS 1.25 in the past, and our e2e tests run on 1.25 as well.

Can you try using the latest 1.19 chart?

ranferimeza commented 9 months ago

We've run on EKS 1.25 in the past, and our e2e tests run on 1.25 as well.

Can you try using the latest 1.19 chart?

I'll try this, and report back

ranferimeza commented 8 months ago

Same result, unfortunately. Same error message...

sudermanjr commented 8 months ago

Can you set logging to debug and share the entire log? I can't reproduce this.

ranferimeza commented 8 months ago

`time="2023-12-13T22:32:15Z" level=info msg=----------------------------------

time="2023-12-13T22:32:15Z" level=info msg="rbac-manager 1.7.0 running"

time="2023-12-13T22:32:15Z" level=info msg=----------------------------------

time="2023-12-13T22:32:15Z" level=info msg="Registering components"

time="2023-12-13T22:32:15Z" level=info msg="Watching resources related to RBAC Definitions"

time="2023-12-13T22:32:15Z" level=info msg="Watching RBAC Definitions"

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 84 [running]:

runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x65

sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/log.go:59 +0xbd

sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc00009eac0, {0x1c62700, 0xc0001a20c0}, {0x1a3d2aa, 0x21}, {0x0, 0x0, 0x0}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/deleg.go:139 +0x68

github.com/go-logr/logr.Logger.Error({{0x1c7ccd8?, 0xc00009eac0?}, 0x0?}, {0x1c62700, 0xc0001a20c0}, {0x1a3d2aa, 0x21}, {0x0, 0x0, 0x0}) /go/pkg/mod/github.com/go-logr/logr@v1.2.4/logr.go:299 +0xda

sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1({0x1c7a038?, 0xc000184870?}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:68 +0x1a5

k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1(0xc000184870?, {0x1c7a038?, 0xc000184870?}) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:62 +0x5d

k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext({0x1c7a038, 0xc000184870}, {0x1c78d40?, 0xc0001a2a80}, 0x1, 0x0, 0x0?) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:63 +0x205

k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel({0x1c7a038, 0xc000184870}, 0x0?, 0x0?, 0x0?) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/poll.go:33 +0x5c

sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:56 +0xfa

created by sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:48 +0x1e5

time="2023-12-13T22:34:45Z" level=error msg="[failed to wait for rbacdefinition caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.RBACDefinition, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]: unable to run the manager"`

ranferimeza commented 8 months ago

Please, do let me know how to set logging for the rbac-manager to debug. Not really to deal with the apps within a K8s cluster, as I only build the infra and let people install stuff on it.

sudermanjr commented 8 months ago

I don't think debug is going to give us a ton more info, but you should be able to set it by adding the helm value extraArgs=log-level=debug

It looks like in v1.6.0 we upgraded client-go from 0.26 to 0.27 which may have introduced an incompatibility with k8s 1.25 (which is End of Life at this point). So perhaps going back to 1.5.0 of rbac-manager (chart version 1.16.0) would work?

https://github.com/FairwindsOps/rbac-manager/releases/tag/v1.5.0 https://artifacthub.io/packages/helm/fairwinds-stable/rbac-manager/1.16.0

ranferimeza commented 8 months ago

I'll try this and let you know. Thanks!

ranferimeza commented 8 months ago

Update: downgraded rbac-manager to the suggested version, and now it fails due to a liveness probe failure... I'll go back to the latest version and enable logging to see if that helps figuring this out. Thanks!

sudermanjr commented 8 months ago

liveness probe could be CPU throttling. As a matter of fact, so could your original issue. Is this a large cluster? What are your cpu/mem requests/limits?

ranferimeza commented 8 months ago

Hi, the cluster is "large": 6 nodes total, separated in 3 node groups of 2 nodes each dedicated to separate workloads. The instances are m5.large for the "lesser" node groups and c5a.xlarge for the most important node group, where the main apps run. rbac-manager runs on one of the m5.large node groups, of course. There are no limits: we deploy the same configuration and instance types for 1.24 and we did not see this issue.

And, as you mentioned, debug did not help:

`time="2023-12-15T15:14:35Z" level=info msg=---------------------------------- time="2023-12-15T15:14:35Z" level=info msg="rbac-manager 1.7.0 running" time="2023-12-15T15:14:35Z" level=info msg=----------------------------------

time="2023-12-15T15:14:35Z" level=debug msg="Setting up client for manager"

time="2023-12-15T15:14:35Z" level=debug msg="Setting up manager"

time="2023-12-15T15:14:35Z" level=info msg="Registering components"

time="2023-12-15T15:14:35Z" level=debug msg="Setting up scheme"

time="2023-12-15T15:14:35Z" level=debug msg="Setting up controller"

time="2023-12-15T15:14:35Z" level=info msg="Watching resources related to RBAC Definitions"

time="2023-12-15T15:14:35Z" level=info msg="Watching RBAC Definitions"

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 63 [running]:

runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x65

sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/log.go:59 +0xbd

sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc000098fc0, {0x1c62700, 0xc000446620}, {0x1a3d2aa, 0x21}, {0x0, 0x0, 0x0}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/deleg.go:139 +0x68

github.com/go-logr/logr.Logger.Error({{0x1c7ccd8?, 0xc000098fc0?}, 0x0?}, {0x1c62700, 0xc000446620}, {0x1a3d2aa, 0x21}, {0x0, 0x0, 0x0}) /go/pkg/mod/github.com/go-logr/logr@v1.2.4/logr.go:299 +0xda

sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1({0x1c7a038?, 0xc00039e190?}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:68 +0x1a5

k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1(0xc00039e190?, {0x1c7a038?, 0xc00039e190?}) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:62 +0x5d

k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext({0x1c7a038, 0xc00039e190}, {0x1c78d40?, 0xc0001edde0}, 0x1, 0x0, 0x0?) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:63 +0x205

k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel({0x1c7a038, 0xc00039e190}, 0x0?, 0x0?, 0x0?) /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/poll.go:33 +0x5c

sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:56 +0xfa

created by sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:48 +0x1e5

time="2023-12-15T15:17:05Z" level=error msg="[failed to wait for namespace caches to sync: timed out waiting for cache to be synced for Kind *v1.Namespace, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]: unable to run the manager"`