FairwindsOps / rbac-manager

A Kubernetes operator that simplifies the management of Role Bindings and Service Accounts.
https://fairwinds.com
Apache License 2.0
1.48k stars 117 forks source link

Certain rolebindings seem to only sync on rbac-manager Pod reboot #476

Closed bcha closed 2 months ago

bcha commented 4 months ago

What happened?

I'm not quite sure what's going on, but we noticed that certain rolebindings can take quite a while, sometimes hours to appear on their namespaces. After further testing it looks like they are synced on restarts and new Pod creations. Other rolebindings appear pretty much immediately.

What did you expect to happen?

All rolebindings should be synced pretty fast to their namespaces.

How can we reproduce this?

We have setup like this. Old generic "developer" rbac which does get synced pretty much immediately:

apiVersion: rbacmanager.reactiveops.io/v1beta1
kind: RBACDefinition
metadata:
  name: developers
rbacBindings:
  - name: developer
    subjects:
      - kind: Group
        name: developers
    roleBindings:
      - clusterRole: edit
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values:
              - default
              - kube-system
              - kyverno
    clusterRoleBindings:
      - clusterRole: view
      - clusterRole: developer
      - clusterRole: developers-extra-permissions

And then we have these newer rbac that are like per-team, so that team-a can have more permissions to their own namespaces etc. These only seem to sync on rbac-manager restart:

apiVersion: rbacmanager.reactiveops.io/v1beta1
kind: RBACDefinition
metadata:
  name: team-asdfgh-rbac-definition
rbacBindings:
  - name: team-asdfgh
    subjects:
      - kind: Group
        name: team-asdfgh
    roleBindings:
      - clusterRole: edit
        namespaceSelector:
          matchLabels:
            app-owner: team-asdfgh
      - clusterRole: edit
        namespaceSelector:
          matchLabels:
            developers: edit
    clusterRoleBindings:
      - clusterRole: view
      - clusterRole: support
      - clusterRole: developers-extra-permissions
➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   117s

➜ krrd -n rbac-manager
deployment.apps/rbac-manager restarted

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                                                       ROLE               AGE
developers-developer-edit                                  ClusterRole/edit   2m14s
team-asdfgh-rbac-definition-team-asdfgh-edit               ClusterRole/edit   6s

Could the issue somehow be related to matchLabels usage? Logs aren't really helping, there's nothing relevant.

Version

rbac-manager latest helm-chart 1.20, so the app version is v1.8.0 We're running this on eks with k8s version 1.29

Search

Code of Conduct

Additional context

No response

sudermanjr commented 4 months ago

The reconciliation loop should run on any change to an rbacDefinition or Namespace. Is one of those happening during this time that you're expecting the updates to happen? I'm not able to discern exactly your workflow here

bcha commented 4 months ago

The reconciliation loop should run on any change to an rbacDefinition or Namespace. Is one of those happening during this time that you're expecting the updates to happen? I'm not able to discern exactly your workflow here

Yes. This example from the OP was just after a new Namespace creation:

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   117s

➜ krrd -n rbac-manager
deployment.apps/rbac-manager restarted

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                                                       ROLE               AGE
developers-developer-edit                                  ClusterRole/edit   2m14s
team-asdfgh-rbac-definition-team-asdfgh-edit               ClusterRole/edit   6s

developers-developer-edit gets there immediately, but team-asdfgh-rbac-definition-team-asdfgh-edit only after rbac-manager restart. We have three clusters with this setup and I'm able to replicate this issue on all of them.

We first noticed this issue a few days ago as one of the developers didnt have correct permissions to their namespace. They were missing that app-owner label in the namespace. After adding the label to the namespace we noticed that the per-team rolebinding was not syncing there until rbac-manager was restarted.

sudermanjr commented 4 months ago

That's very odd indeed. I'm noticing that our k8s client libraries haven't been updated in a bit. I wonder if that's causing some incompatibility issues with 1.29. I opened up https://github.com/FairwindsOps/rbac-manager/pull/477 to update the e2e tests with later versions of k8s.

sudermanjr commented 4 months ago

Can you please re-test with v1.9.0 of rbac-manager?

bcha commented 4 months ago

Using rbac-manager v1.9.0 now, upgraded via latest helm chart. also on k8s 1.30 now, freshly upgraded

➜ kgd -n rbac-manager -o yaml | grep -i image:
          image: quay.io/reactiveops/rbac-manager:v1.9.0

➜ k version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1-eks-1de2ab1

Did some more testing. Issue persists 😢

Missing for 12mins since namespace creation ❌

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   12m

kubectl label does not trigger ❌

➜ k label ns turmio-134 app-owner=team-another --overwrite
namespace/turmio-134 labeled

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   14m

kubectl edit namespace does not trigger ❌

➜ k edit ns turmio-134
namespace/turmio-134 edited

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   15m

rbac-manager restart triggers ✅

➜ krrd -n rbac-manager
deployment.apps/rbac-manager restarted

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                                                       ROLE               AGE
developers-developer-edit                                  ClusterRole/edit   15m
team-another-rbac-definition-team-another-edit             ClusterRole/edit   11s

kubectl edit rbacdefinition also triggers ✅

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                        ROLE               AGE
developers-developer-edit   ClusterRole/edit   37s

➜ k edit rbacdefinitions.rbacmanager.reactiveops.io team-asd-rbac-definition
rbacdefinition.rbacmanager.reactiveops.io/team-asd-rbac-definition edited

➜ kg rolebindings.rbac.authorization.k8s.io
NAME                                           ROLE               AGE
developers-developer-edit                      ClusterRole/edit   77s
team-asd-rbac-definition-team-asd-edit         ClusterRole/edit   4s
bcha commented 4 months ago

Same exact issue happens even when I change from..

      - clusterRole: edit
        namespaceSelector:
          matchLabels:
            app-owner: team-asdfgh

...to:

      - clusterRole: edit
        namespaceSelector:
          matchExpressions:
            - key: app-owner
              operator: In
              values: [ team-asdfgh ]

So that doesnt seem to work as workaround either.

sudermanjr commented 4 months ago

Thanks. That all helps a lot. We can look into this further now.

bcha commented 1 month ago

@sudermanjr any chance of this getting worked at?