Closed kvaps closed 5 days ago
I just read this PR over, haven't tested it yet. That is really neat! That is about how I imagined that it would look, and that is about how easy I imagined it would be. I will give it a try and see how it works.
I would not have foreseen that adding flux-operator in the core/cozy-system would so quickly lead us to this point.
When I enabled all the addons, I landed in a failed state:
% kg hr -w|grep -v 'succeeded'
NAMESPACE NAME AGE READY STATUS
tenant-root kubernetes-harvey-cert-manager 9m9s False dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root kubernetes-harvey-cilium 9m9s False Helm install failed for release cozy-cilium/cilium with chart cozy-cilium@0.1.29: client rate limiter Wait returned an error: context deadline exceeded
tenant-root kubernetes-harvey-fluxcd 9m9s False dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root kubernetes-harvey-fluxcd-operator 9m9s False dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root kubernetes-harvey-ingress-nginx 9m9s False dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-test kubernetes-cluster-cilium 7m10s False Helm install failed for release cozy-cilium/cilium with chart cozy-cilium@0.1.29: client rate limiter Wait returned an error: context deadline exceeded
tenant-test kubernetes-cluster-fluxcd 7m10s False dependency 'tenant-test/kubernetes-cluster-cilium' is not ready
tenant-test kubernetes-cluster-fluxcd-operator 7m10s False dependency 'tenant-test/kubernetes-cluster-cilium' is not ready
The cilium installs just took a bit longer than timeout to become ready, they didn't have any problem, but these addon charts requested by users have to cope with the HelmRelease state machine, that sometimes requires a flux reconcile hr --force
I think you have defined retries and remediation in the HelmRelease resources elsewhere in the system, perhaps we can add some more of that here. Meanwhile I've done a flux reconcile hr --force
on each cilium here, and that's enough to get it progressing nicely. I've just tested all the addons together on one cluster + only the flux addon on the other cluster, this is looking like it's going to be really handy! 👏
I am seeing some issues today on a fresh deployment that I haven't seen before.
kubectl get hr -n cozy-fluxcd fluxcd-operator -o jsonpath='{.spec.values}' | helm upgrade -i --post-renderer ../../../scripts/fluxcd-kustomize.sh -n cozy-fluxcd fluxcd-operator . -f -
Error: UPGRADE FAILED: error while running post render on files: error while running command /cozystack/scripts/fluxcd-kustomize.sh. error output:
Variable $NAME is not set!
: exit status 1
make: *** [../../../scripts/package-system.mk:11: apply] Error 1
make: Leaving directory '/cozystack/packages/system/fluxcd-operator'
time="2024-06-27T19:38:04Z" level=fatal msg="failed to run" err="exit status 2"
These errors show up just about as long as it takes for cilium to come online. The Flux controller pods will not be scheduled before CNI is ready, that's normal and good, I think they should communicate over CNI and not hostNetwork
👍 that configuration is fine.
I'm sure that one thing has something to do with the other but I'm not sure how. Why does postrender kustomization fail when it gets run but just the first few times. Is there somewhere in installer.sh that package-system.mk gets invoked without a NAME
or NAMESPACE
? I can't see it right now.
Perhaps we should start the cilium
install a bit earlier? I noticed that flux-operator
installs, becomes ready, etc. before Cilium install even begins. That should be reversed if possible, even if flux-operator
is running with hostNetwork
it might have nothing to do with this bug, it should be fixed before releasing 0.8 🤞
Whatever failure I saw also resulted in the flux-operator
landing in state with HelmReleases
suspended:
I see some scripts in here that suspend and resume stuff, not sure what they're trying to do, but it looks like they're trying to avoid cross-talking with the Flux controllers themselves when helm template|apply
type operation is in progress.
I see it now, when flux_is_ok
the part of the invocation where targets="apply resume"
is never reached
It should also check that the HelmRelease
fluxcd
is ready, then I think it will work. Making a change to test it out locally.
I think that this line does not pass NAME
and NAMESPACE
to the post render script, that's causing make apply
to fail whenever it gets called for fluxcd-operator
here:
kubectl get hr -n $(NAMESPACE) $(NAME) -o jsonpath='{.spec.values}' | helm upgrade -i --post-renderer ../../../scripts/fluxcd-kustomize.sh -n $(NAMESPACE) $(NAME) . -f -
in scripts/package-system.mk
Somehow I still landed in this situation. Going to try another from-scratch e2e.
I think this is likely to land in this state because it might take longer than 60s to have the fluxinstance CRD ready, and that there is no recovery path if it happens like that...
make -C packages/system/fluxcd-operator $targets
wait_for_crds fluxinstances.fluxcd.controlplane.io
make -C packages/system/fluxcd $targets
wait_for_crds helmreleases.helm.toolkit.fluxcd.io helmrepositories.source.toolkit.fluxcd.io
wait_for_crds() {
timeout 60 sh -c "until kubectl get crd $*; do sleep 1; done"
}
When I go back and run
make -C packages/system/fluxcd apply resume
it looks like all is fine.
FWIW, this issue doesn't block the rest of the HelmReleases from progressing
I've had a think about what it means to land in this state, and I've realized that it means users can update any part of the FluxInstance.spec
(including any of spec.
kustomize.patches
, components
, cluster
, or distribution
) without worrying about being overriden by the cozystack reconciler.
So, I'm not sure that it's a bug to land in that state? I think this can merge, if you have a look at #191
There was some drift in my local build/deploy branch, where I've been test-tagging releases. I don't see the issues described in the screenshots anymore when I rebased my deploy
branch on the PR merged, it must have been an error in my fork.
The install proceeds smoothly and the fluxcd
helmrelease gets resumed and reconciled A-OK.
I'd like to test the addon stuff a bit more, but I don't think I have any other hangups about releasing 0.8.0 now. 🚀
Follow up to https://github.com/aenix-io/cozystack/pull/183
fixes https://github.com/aenix-io/cozystack/issues/159 fixes https://github.com/aenix-io/cozystack/issues/157 fixes https://github.com/aenix-io/cozystack/issues/185