aenix-io / cozystack

Free and Open Source PaaS-platform for seamless management of virtual machines, managed Kubernetes, and Databases-as-a-Service
https://cozystack.io
Apache License 2.0
542 stars 24 forks source link

Update Tenant Kubernetes Addons #186

Closed kvaps closed 5 days ago

kvaps commented 1 week ago

Follow up to https://github.com/aenix-io/cozystack/pull/183

fixes https://github.com/aenix-io/cozystack/issues/159 fixes https://github.com/aenix-io/cozystack/issues/157 fixes https://github.com/aenix-io/cozystack/issues/185

kingdonb commented 1 week ago

I just read this PR over, haven't tested it yet. That is really neat! That is about how I imagined that it would look, and that is about how easy I imagined it would be. I will give it a try and see how it works.

I would not have foreseen that adding flux-operator in the core/cozy-system would so quickly lead us to this point.

kingdonb commented 6 days ago

When I enabled all the addons, I landed in a failed state:

% kg hr -w|grep -v 'succeeded'
NAMESPACE                        NAME                                 AGE     READY   STATUS
tenant-root                      kubernetes-harvey-cert-manager       9m9s    False   dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root                      kubernetes-harvey-cilium             9m9s    False   Helm install failed for release cozy-cilium/cilium with chart cozy-cilium@0.1.29: client rate limiter Wait returned an error: context deadline exceeded
tenant-root                      kubernetes-harvey-fluxcd             9m9s    False   dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root                      kubernetes-harvey-fluxcd-operator    9m9s    False   dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-root                      kubernetes-harvey-ingress-nginx      9m9s    False   dependency 'tenant-root/kubernetes-harvey-cilium' is not ready
tenant-test                      kubernetes-cluster-cilium            7m10s   False   Helm install failed for release cozy-cilium/cilium with chart cozy-cilium@0.1.29: client rate limiter Wait returned an error: context deadline exceeded
tenant-test                      kubernetes-cluster-fluxcd            7m10s   False   dependency 'tenant-test/kubernetes-cluster-cilium' is not ready
tenant-test                      kubernetes-cluster-fluxcd-operator   7m10s   False   dependency 'tenant-test/kubernetes-cluster-cilium' is not ready

The cilium installs just took a bit longer than timeout to become ready, they didn't have any problem, but these addon charts requested by users have to cope with the HelmRelease state machine, that sometimes requires a flux reconcile hr --force

I think you have defined retries and remediation in the HelmRelease resources elsewhere in the system, perhaps we can add some more of that here. Meanwhile I've done a flux reconcile hr --force on each cilium here, and that's enough to get it progressing nicely. I've just tested all the addons together on one cluster + only the flux addon on the other cluster, this is looking like it's going to be really handy! 👏

kingdonb commented 5 days ago

I am seeing some issues today on a fresh deployment that I haven't seen before.

kubectl get hr -n cozy-fluxcd fluxcd-operator -o jsonpath='{.spec.values}' | helm upgrade -i --post-renderer ../../../scripts/fluxcd-kustomize.sh -n cozy-fluxcd fluxcd-operator . -f -
Error: UPGRADE FAILED: error while running post render on files: error while running command /cozystack/scripts/fluxcd-kustomize.sh. error output:
Variable $NAME is not set!
: exit status 1
make: *** [../../../scripts/package-system.mk:11: apply] Error 1
make: Leaving directory '/cozystack/packages/system/fluxcd-operator'
time="2024-06-27T19:38:04Z" level=fatal msg="failed to run" err="exit status 2"

These errors show up just about as long as it takes for cilium to come online. The Flux controller pods will not be scheduled before CNI is ready, that's normal and good, I think they should communicate over CNI and not hostNetwork 👍 that configuration is fine.

I'm sure that one thing has something to do with the other but I'm not sure how. Why does postrender kustomization fail when it gets run but just the first few times. Is there somewhere in installer.sh that package-system.mk gets invoked without a NAME or NAMESPACE? I can't see it right now.

Perhaps we should start the cilium install a bit earlier? I noticed that flux-operator installs, becomes ready, etc. before Cilium install even begins. That should be reversed if possible, even if flux-operator is running with hostNetwork it might have nothing to do with this bug, it should be fixed before releasing 0.8 🤞

kingdonb commented 5 days ago

Whatever failure I saw also resulted in the flux-operator landing in state with HelmReleases suspended:

Screenshot 2024-06-27 at 3 54 04 PM

I see some scripts in here that suspend and resume stuff, not sure what they're trying to do, but it looks like they're trying to avoid cross-talking with the Flux controllers themselves when helm template|apply type operation is in progress.

kingdonb commented 5 days ago

I see it now, when flux_is_ok the part of the invocation where targets="apply resume" is never reached

It should also check that the HelmRelease fluxcd is ready, then I think it will work. Making a change to test it out locally.

kingdonb commented 5 days ago

I think that this line does not pass NAME and NAMESPACE to the post render script, that's causing make apply to fail whenever it gets called for fluxcd-operator here:

  kubectl get hr -n $(NAMESPACE) $(NAME) -o jsonpath='{.spec.values}' | helm upgrade -i --post-renderer ../../../scripts/fluxcd-kustomize.sh -n $(NAMESPACE) $(NAME) . -f -

in scripts/package-system.mk

kingdonb commented 5 days ago
Screenshot 2024-06-27 at 9 10 24 PM

Somehow I still landed in this situation. Going to try another from-scratch e2e.

kingdonb commented 5 days ago

I think this is likely to land in this state because it might take longer than 60s to have the fluxinstance CRD ready, and that there is no recovery path if it happens like that...

make -C packages/system/fluxcd-operator $targets
wait_for_crds fluxinstances.fluxcd.controlplane.io
make -C packages/system/fluxcd $targets
wait_for_crds helmreleases.helm.toolkit.fluxcd.io helmrepositories.source.toolkit.fluxcd.io

wait_for_crds() {
  timeout 60 sh -c "until kubectl get crd $*; do sleep 1; done"
}

When I go back and run

make -C packages/system/fluxcd apply resume

it looks like all is fine.

FWIW, this issue doesn't block the rest of the HelmReleases from progressing

kingdonb commented 5 days ago

I've had a think about what it means to land in this state, and I've realized that it means users can update any part of the FluxInstance.spec (including any of spec. kustomize.patches, components, cluster, or distribution) without worrying about being overriden by the cozystack reconciler.

So, I'm not sure that it's a bug to land in that state? I think this can merge, if you have a look at #191

kingdonb commented 4 days ago

There was some drift in my local build/deploy branch, where I've been test-tagging releases. I don't see the issues described in the screenshots anymore when I rebased my deploy branch on the PR merged, it must have been an error in my fork.

The install proceeds smoothly and the fluxcd helmrelease gets resumed and reconciled A-OK.

I'd like to test the addon stuff a bit more, but I don't think I have any other hangups about releasing 0.8.0 now. 🚀