Closed evanrich closed 2 years ago
Hi @evanrich thanks for posting, the toolkit.fluxcd.io
API refers to Flux v2, FYI this is the Flux v1 repo.
If you kubectl get -oyaml
your Flux Kustomizations, ever since Flux v0.18 you will see a new section in status
called inventory
. I'm taking a wild guess here that if you look in it and find the entries
section, for example:
entries:
- id: _kustomizations.kustomize.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
v: v1
- id: _providers.notification.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
v: v1
- id: _receivers.notification.toolkit.fluxcd.io_apiextensions.k8s.io_CustomResourceDefinition
v: v1
- id: _deis__Namespace
v: v1
- id: _flux-system__Namespace
v: v1
- id: _keycloak__Namespace
v: v1
- id: _kube-oidc-proxy__Namespace
v: v1
This is where Flux keeps track of which resources it is reconciling. Do you see the obsolete reference from the v1beta1 API listed in there?
I wonder if this is a weird effect of (1) upgrading the cluster past K8s v1.22 (which removes the obsolete beta APIs that were deprecated) while there were ClusterRoleBinding resources that existed, and were declared in Flux as v1beta1, then (2) removing them from the repo while the v1beta API was no longer in service (after they had already been upgraded by the cluster to v1).
My expectation for upgrading a cluster that runs Flux past API deprecations like this are that it should be done in the reverse order as above (eg. 2. then 1.) so that there is no point where Flux tries to reconcile resources for an API that isn't existing anymore. I'm guessing based on your report that you did this in the opposite order. I can't say for sure if we tested that path, but I guess you might find some unexpected or undefined results there as it reads like a dark corner or edge case hazard.
Try removing the ClusterRoleBinding from the repo completely (you should disable garbage collection first, in a separate commit and push, set spec.prune
to false, reconciling Flux, double checking that prune
is disabled, and then pushing a commit that deletes the CRB to ensure that it remains on the cluster and no workloads are impacted...)
Then, once Flux is no longer aware of the resource, add it back to its original location in flux-system again.
It sounds like Flux is trying to upgrade a resource for an API that no longer exists. If this is what's happening, maybe there is something that Flux can do better; but without a list of APIs and what versions they upgrade from and to, hardcoded into Flux, I'm not sure how Flux can handle upgrading resources in the usual Kubernetes way like this any better than it already does. Please confirm if this helps (then, assuming it worked, remember to re-enable spec.prune
when you are finished 👍 )
Make sure also to set a timeout on your Flux kustomization. Otherwise, you may get strange behavior between waiting for health checks to complete. I think the default timeout is spec.interval - 30s
so if your interval is 10m, you will have to wait at least that long to see changes taking effect. By setting timeout
to a lower number of seconds, you can get better (faster) feedback from changes to a Kustomization that is having trouble.
I went back and re-read your report and I guess you may have already tried all of my suggestions. I'm not sure if any of this will help, I'm sorry for the trouble you're experiencing. You can try https://github.com/fluxcd/flux2/discussions where this might get more attention than here in the Flux v1 repo, which I think hardly anyone still monitors.
@kingdonb aww crap sorry about that, I googled fluxcd issues and got this lol. Do you want me to move the issue? I tried running the following:
kubectl get kustomization flux-system -n flux-system -o yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kustomize.toolkit.fluxcd.io/v1beta1","kind":"Kustomization","metadata":{"annotations":{"kustomize.toolkit.fluxcd.io/checksum":"08460cd08b4fa485ed68582eba9b1addf83f4840"},"labels":{"kustomize.toolkit.fluxcd.io/name":"flux-system","kustomize.toolkit.fluxcd.io/namespace":"flux-system"},"name":"flux-system","namespace":"flux-system"},"spec":{"interval":"10m0s","path":"./cluster","prune":true,"sourceRef":{"kind":"GitRepository","name":"flux-system"},"validation":"client"}}
kustomize.toolkit.fluxcd.io/checksum: 08460cd08b4fa485ed68582eba9b1addf83f4840
reconcile.fluxcd.io/requestedAt: "2021-09-07T13:24:35.508528687-07:00"
creationTimestamp: "2021-08-20T05:25:32Z"
finalizers:
- finalizers.fluxcd.io
generation: 1
labels:
kustomize.toolkit.fluxcd.io/name: flux-system
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: flux-system
namespace: flux-system
resourceVersion: "112177165"
uid: 92ad0397-afda-45f6-9634-ebfa736c208b
spec:
force: false
interval: 10m0s
path: ./cluster
prune: true
sourceRef:
kind: GitRepository
name: flux-system
validation: client
status:
conditions:
- lastTransitionTime: "2021-12-03T19:37:04Z"
message: |
ClusterRoleBinding/gitlab-admin dry-run failed, error: no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
reason: ReconciliationFailed
status: "False"
type: Ready
lastAppliedRevision: master/bd16572bd961e1064914bdac473f1ffa82529707
lastAttemptedRevision: master/85704d8c663cafaf824e2cbc5bffd4fdfc7c55b7
lastHandledReconcileAt: "2021-09-07T13:24:35.508528687-07:00"
observedGeneration: 1
I don't see an entries list. I'll try posting this over in the v2 forum as well, thanks!
(I have a setup where I can try to reproduce this, if it's a weird behavior between upgrading K8s 1.21 and 1.22, then I'll want to understand how it goes wrong in the worst case anyway)
Let me give it a quick attempt to see if I can get one stuck in the same way as you've described, it sounds like there might be something there. 👍
@kingdonb thanks. FWIW, I also ran the following:
flux bootstrap github \
--components-extra=image-reflector-controller,image-automation-controller \
--owner=$GITHUB_USER \
--repository=k8sstuff \
--branch=master \
--path cluster \
--read-write-key \
--personal
and it throws the same message, even though pods complete, so following the update guide to go from 0.14 to 0.24 of fluxcd v2 seems to do the same thing. I cross posted this on the v2 forum, here's the link: https://github.com/fluxcd/flux2/discussions/2175
On my cluster just upgraded from k8s 1.21.7 to 1.22.4, I did not have any trouble writing an update to a ClusterRoleBinding that was stuck on v1beta1
RBAC API straightforwardly upgrading it to v1
without issue on Flux v0.24.0:
75-deis False ClusterRoleBinding/deis:deis-example namespace not specified, error: the server could not find the requested resource staging/e3c5cf4ece118e8d4ce10ee4c04dba828a623aa6 False
75-deis Unknown reconciliation in progress staging/e3c5cf4ece118e8d4ce10ee4c04dba828a623aa6 False
75-deis True Applied revision: staging/4a13b690c964c2324dc60caebdcdbc8e5e9af24f staging/4a13b690c964c2324dc60caebdcdbc8e5e9af24f False
I am not sure how to reproduce the trouble that you are having. Can you please reiterate what has been upgraded in what order in enough detail that it will reproduce? I tried out the edge case that I was most concerned about / worried that it might have been un-tested, and I couldn't get it to fail at all. It may owe to my having set a value for spec.timeout
but that would be pretty surprising.
https://github.com/kingdonb/bootstrap-repo/commit/4a13b690c964c2324dc60caebdcdbc8e5e9af24f
This commit was pushed after confirming the control plane and kubelets had all been upgraded and the v1beta API was definitely removed, causing Flux to be unable to reconcile. I didn't have any trouble getting out of that by simply upgrading the API version in-place in the file. So, not sure how to reproduce the trouble in the first place.
Agh, my error was for a different reason, so my test is spoilt...
I just noticed ClusterRoleBinding/deis:deis-example namespace not specified
got in the way of the particular edge case that you are having. Let me try again! (I happen to keep a cluster running that I can just tear down and re-bootstrap/ destructively upgrade without disturbing anyone for occasions such as this one...)
@kingdonb sorry to bother, I just resolved it. For some reason, pushing locally to github wasn't updating the resource, I logged in and saw that the gitlab-admin service account file still had the v1beta1 api version. I edited the file directly in github, and now it works. sorry for being dense. =)
Glad you got it solved! Got all the cobwebs out of my cluster at least. Thanks for using Flux 🥇
Describe the bug
here is my CRB yaml:
I get the following warning when flux tries to reconcile. This is after upgrading to 1.22 version of k8s. I've removed/changed the api version in the gitlab-admin ClusterRoleBinding to v1 instead of v1beta1, even deleting the clusterrolebinding and re-applying, as well as running bootstrap again, and restarting the kustomization pod, but it still keeps throwing this error. If i do the following:
I get no further matches, so why does flux get kustomizations --watch still fail?
Steps to reproduce
Install flux using bootstrap in k8s 1.22.x Run either fluxctl get kustomizations or fluxctl bootstrap blah blah to upgrade.
Expected behavior
cluster reconciles
Kubernetes version / Distro / Cloud provider
kubeadm 1.22.2
Flux version
flux: v0.24.0 helm-controller: v0.14.0 image-automation-controller: v0.18.0 image-reflector-controller: v0.14.0 kustomize-controller: v0.18.1 notification-controller: v0.19.0 source-controller: v0.19.0
Git provider
github
Container Registry provider
dockerhub
Additional context
No response
Maintenance Acknowledgement
Code of Conduct