flux no longer able to reconcile cluster, complains about v1beta1 rbac

evanrich commented 2 years ago

Describe the bug

flux-system     False   ClusterRoleBinding/gitlab-admin dry-run failed, error: no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1"  master/bd16572bd961e1064914bdac473f1ffa82529707    False

here is my CRB yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kustomize.toolkit.fluxcd.io/checksum: 08460cd08b4fa485ed68582eba9b1addf83f4840
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: gitlab-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: gitlab-admin
  namespace: kube-system

I get the following warning when flux tries to reconcile. This is after upgrading to 1.22 version of k8s. I've removed/changed the api version in the gitlab-admin ClusterRoleBinding to v1 instead of v1beta1, even deleting the clusterrolebinding and re-applying, as well as running bootstrap again, and restarting the kustomization pod, but it still keeps throwing this error. If i do the following:

kubectl get clusterrolebinding -A |grep v1beta1"

I get no further matches, so why does flux get kustomizations --watch still fail?

 flux get kustomizations --watch
NAME            READY   MESSAGE                                                                                                                                         REVISION                                           SUSPENDED
flux-system     False   ClusterRoleBinding/gitlab-admin dry-run failed, error: no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1"  master/bd16572bd961e1064914bdac473f1ffa82529707    False

Steps to reproduce

Install flux using bootstrap in k8s 1.22.x Run either fluxctl get kustomizations or fluxctl bootstrap blah blah to upgrade.

Expected behavior

cluster reconciles

Kubernetes version / Distro / Cloud provider

kubeadm 1.22.2

Flux version

flux: v0.24.0 helm-controller: v0.14.0 image-automation-controller: v0.18.0 image-reflector-controller: v0.14.0 kustomize-controller: v0.18.1 notification-controller: v0.19.0 source-controller: v0.19.0

Git provider

github

Container Registry provider

dockerhub

Additional context

No response

Maintenance Acknowledgement

[X] I am aware of Flux v1's maintenance status

Code of Conduct

[X] I agree to follow this project's Code of Conduct

kingdonb commented 2 years ago

Hi @evanrich thanks for posting, the toolkit.fluxcd.io API refers to Flux v2, FYI this is the Flux v1 repo.

If you kubectl get -oyaml your Flux Kustomizations, ever since Flux v0.18 you will see a new section in status called inventory. I'm taking a wild guess here that if you look in it and find the entries section, for example:

entries:
- id: _kustomizations.kustomize.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
  v: v1
- id: _providers.notification.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
  v: v1
- id: _receivers.notification.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
  v: v1
- id: _deis__Namespace
  v: v1
- id: _flux-system__Namespace
  v: v1
- id: _keycloak__Namespace
  v: v1
- id: _kube-oidc-proxy__Namespace
  v: v1

This is where Flux keeps track of which resources it is reconciling. Do you see the obsolete reference from the v1beta1 API listed in there?

I wonder if this is a weird effect of (1) upgrading the cluster past K8s v1.22 (which removes the obsolete beta APIs that were deprecated) while there were ClusterRoleBinding resources that existed, and were declared in Flux as v1beta1, then (2) removing them from the repo while the v1beta API was no longer in service (after they had already been upgraded by the cluster to v1).

My expectation for upgrading a cluster that runs Flux past API deprecations like this are that it should be done in the reverse order as above (eg. 2. then 1.) so that there is no point where Flux tries to reconcile resources for an API that isn't existing anymore. I'm guessing based on your report that you did this in the opposite order. I can't say for sure if we tested that path, but I guess you might find some unexpected or undefined results there as it reads like a dark corner or edge case hazard.

Try removing the ClusterRoleBinding from the repo completely (you should disable garbage collection first, in a separate commit and push, set spec.prune to false, reconciling Flux, double checking that prune is disabled, and then pushing a commit that deletes the CRB to ensure that it remains on the cluster and no workloads are impacted...)

Then, once Flux is no longer aware of the resource, add it back to its original location in flux-system again.

It sounds like Flux is trying to upgrade a resource for an API that no longer exists. If this is what's happening, maybe there is something that Flux can do better; but without a list of APIs and what versions they upgrade from and to, hardcoded into Flux, I'm not sure how Flux can handle upgrading resources in the usual Kubernetes way like this any better than it already does. Please confirm if this helps (then, assuming it worked, remember to re-enable spec.prune when you are finished 👍 )

kingdonb commented 2 years ago

Make sure also to set a timeout on your Flux kustomization. Otherwise, you may get strange behavior between waiting for health checks to complete. I think the default timeout is spec.interval - 30s so if your interval is 10m, you will have to wait at least that long to see changes taking effect. By setting timeout to a lower number of seconds, you can get better (faster) feedback from changes to a Kustomization that is having trouble.

I went back and re-read your report and I guess you may have already tried all of my suggestions. I'm not sure if any of this will help, I'm sorry for the trouble you're experiencing. You can try https://github.com/fluxcd/flux2/discussions where this might get more attention than here in the Flux v1 repo, which I think hardly anyone still monitors.

evanrich commented 2 years ago

@kingdonb aww crap sorry about that, I googled fluxcd issues and got this lol. Do you want me to move the issue? I tried running the following:

kubectl get kustomization flux-system -n flux-system -o yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kustomize.toolkit.fluxcd.io/v1beta1","kind":"Kustomization","metadata":{"annotations":{"kustomize.toolkit.fluxcd.io/checksum":"08460cd08b4fa485ed68582eba9b1addf83f4840"},"labels":{"kustomize.toolkit.fluxcd.io/name":"flux-system","kustomize.toolkit.fluxcd.io/namespace":"flux-system"},"name":"flux-system","namespace":"flux-system"},"spec":{"interval":"10m0s","path":"./cluster","prune":true,"sourceRef":{"kind":"GitRepository","name":"flux-system"},"validation":"client"}}
    kustomize.toolkit.fluxcd.io/checksum: 08460cd08b4fa485ed68582eba9b1addf83f4840
    reconcile.fluxcd.io/requestedAt: "2021-09-07T13:24:35.508528687-07:00"
  creationTimestamp: "2021-08-20T05:25:32Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 1
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: flux-system
  namespace: flux-system
  resourceVersion: "112177165"
  uid: 92ad0397-afda-45f6-9634-ebfa736c208b
spec:
  force: false
  interval: 10m0s
  path: ./cluster
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  validation: client
status:
  conditions:
  - lastTransitionTime: "2021-12-03T19:37:04Z"
    message: |
      ClusterRoleBinding/gitlab-admin dry-run failed, error: no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
    reason: ReconciliationFailed
    status: "False"
    type: Ready
  lastAppliedRevision: master/bd16572bd961e1064914bdac473f1ffa82529707
  lastAttemptedRevision: master/85704d8c663cafaf824e2cbc5bffd4fdfc7c55b7
  lastHandledReconcileAt: "2021-09-07T13:24:35.508528687-07:00"
  observedGeneration: 1

I don't see an entries list. I'll try posting this over in the v2 forum as well, thanks!

kingdonb commented 2 years ago

(I have a setup where I can try to reproduce this, if it's a weird behavior between upgrading K8s 1.21 and 1.22, then I'll want to understand how it goes wrong in the worst case anyway)

Let me give it a quick attempt to see if I can get one stuck in the same way as you've described, it sounds like there might be something there. 👍

evanrich commented 2 years ago

@kingdonb thanks. FWIW, I also ran the following:

flux bootstrap github \
  --components-extra=image-reflector-controller,image-automation-controller \
  --owner=$GITHUB_USER \
  --repository=k8sstuff \
  --branch=master \
  --path cluster \
  --read-write-key \
  --personal

and it throws the same message, even though pods complete, so following the update guide to go from 0.14 to 0.24 of fluxcd v2 seems to do the same thing. I cross posted this on the v2 forum, here's the link: https://github.com/fluxcd/flux2/discussions/2175

kingdonb commented 2 years ago

On my cluster just upgraded from k8s 1.21.7 to 1.22.4, I did not have any trouble writing an update to a ClusterRoleBinding that was stuck on v1beta1 RBAC API straightforwardly upgrading it to v1 without issue on Flux v0.24.0:

75-deis False   ClusterRoleBinding/deis:deis-example namespace not specified, error: the server could not find the requested resource   staging/e3c5cf4ece118e8d4ce10ee4c04dba828a623aa6    False

75-deis Unknown reconciliation in progress  staging/e3c5cf4ece118e8d4ce10ee4c04dba828a623aa6    False
75-deis True    Applied revision: staging/4a13b690c964c2324dc60caebdcdbc8e5e9af24f  staging/4a13b690c964c2324dc60caebdcdbc8e5e9af24f    False

I am not sure how to reproduce the trouble that you are having. Can you please reiterate what has been upgraded in what order in enough detail that it will reproduce? I tried out the edge case that I was most concerned about / worried that it might have been un-tested, and I couldn't get it to fail at all. It may owe to my having set a value for spec.timeout but that would be pretty surprising.

https://github.com/kingdonb/bootstrap-repo/commit/4a13b690c964c2324dc60caebdcdbc8e5e9af24f

This commit was pushed after confirming the control plane and kubelets had all been upgraded and the v1beta API was definitely removed, causing Flux to be unable to reconcile. I didn't have any trouble getting out of that by simply upgrading the API version in-place in the file. So, not sure how to reproduce the trouble in the first place.

kingdonb commented 2 years ago

Agh, my error was for a different reason, so my test is spoilt...

I just noticed ClusterRoleBinding/deis:deis-example namespace not specified got in the way of the particular edge case that you are having. Let me try again! (I happen to keep a cluster running that I can just tear down and re-bootstrap/ destructively upgrade without disturbing anyone for occasions such as this one...)

evanrich commented 2 years ago

@kingdonb sorry to bother, I just resolved it. For some reason, pushing locally to github wasn't updating the resource, I logged in and saw that the gitlab-admin service account file still had the v1beta1 api version. I edited the file directly in github, and now it works. sorry for being dense. =)

kingdonb commented 2 years ago

Glad you got it solved! Got all the cobwebs out of my cluster at least. Thanks for using Flux 🥇

fluxcd / flux