aenix-io / cozystack

Free and Open Source PaaS-platform for seamless management of virtual machines, managed Kubernetes, and Databases-as-a-Service
https://cozystack.io
Apache License 2.0
542 stars 24 forks source link

Allow to run FluxCD per tenant #184

Open kvaps opened 1 week ago

kvaps commented 1 week ago

We want the Flux checkbox right next to etcd, ingress, monitoring

This would allow to move FluxCD to system applications so it will have a packaged Helm-chart to install from

kvaps commented 1 week ago

Probably locked by https://github.com/controlplaneio-fluxcd/flux-operator/issues/55

kingdonb commented 4 days ago

The addons from #186 probably addresses this, however the current state in the main branch lands in Helm Install Failed timeout state (the Cilium HelmRelease) - at least on my machine, things take a while to come online, but Cilium helmrelease isn't waiting for the controlplane to become ready before it starts the clock, (and since all the other addons depend on Cilium, this is only resolved by eg. flux reconcile helmrelease --force -n tenant-test kubernetes-cluster-cilium

The flux-operator really isn't designed to run more than one instance of Flux per cluster. I'm not sure how that would work when upgrades are done, you'd have to coordinate the upgrade between all consumers of the CRD so there is no mismatch of CRD version and controllers, and so I'm not sure it would meet the requirements.

Managing instances of Flux on remote clusters is a feature I could see coming in the future, perhaps, since each FluxInstance then can have its own controlplane, manage its own set of CRDs, etc.

But the current solution of installing flux-operator as a system package and putting it on every cluster that requires Flux to be installed makes sense, the only open questions for me are "how does configuration happen" - if tenants order services through the dashboard, they could configure their own Flux instance, but I got the impression from reading the docs that tenants don't really get access to the dashboard so they can't really set up their own configuration like that.

The 3-way merge works, but the tenants also can't control suspend/resume helmreleases (or force reconcile, or indeed anything about HelmRelease spec that lives on the cozy.local cluster). So they are not really in control of those definitions, as long as they are imposed from outside as it's done by the addon, then they can be overwritten by a helm upgrade at any time.

Would it make any sense at all for the flux addon to have a mode that only provides flux-operator and leaves the creation/configuration of any FluxInstance to the cluster user?

gecube commented 4 days ago

Also I'd want to add that probably @stefanprodan is preparing the CRDs in such a way that they are backward compatible. So CRDs are should always be so... and I can explain.

kingdonb commented 4 days ago

The design of Flux multi-tenancy is orthogonal from the notion of sharding, or multiple flux instances per cluster.

Sharding was added as a last resort for users who need tens of thousands of HelmReleases per cluster, it is a scaling consideration now more than an accommodation for tenancy.

While it has been considered that you can use it for other purposes like flux-per-tenant (especially where secrets need to be mounted into to the controller eg. AWS IRSA or vault for secrets) those ideas are not really well explored in documentation or tests, there is no support for multi-tenant sharding today.

Yes, the CRDs have some strong guarantees for backwards compatibility, but it doesn't cancel out this coordination issue.

Users of Flux need to have a flux CLI that matches their deployed controllers & CRDs one-to-one. If it doesn't match, you get strange errors now and then - with the extent of the mismatch and the changes between versions, the issues will vary.

So it is a coordination issue. If users on the same cluster are tenants, they should not be expected to coordinate with each other in any way. The idea of a FluxCD per tenant implies that a tenant can choose when to do this upgrade. When one tenant upgrades, they will install CRDs from the next release, all other tenants must either upgrade or suffer the mismatch.

If flux-operator always installs the latest Flux instance, that solves the coordination issue, but that tenants get to decide their own upgrade cadence and are not bound to each other seems like it would be a fundamental purpose of this kind of implementation. And it's unfortunately not as simple as a version skew policy like kubectl, the Flux docs are (or at least should be) quite explicit that the CLI version and distribution installed need to match. You can have situations where from one minor version to the next, new CLI will not work with the CRDs from the prior version because of APIVersion mismatches.

gecube commented 4 days ago

we will have the same issue with the tenant cluster versions and compatibility between kubectl utility and tenants clusters...

kingdonb commented 4 days ago

Not exactly, kubectl supports a version skew policy of plus-or-minus one:

https://kubernetes.io/releases/version-skew-policy/#kubectl

So you are guaranteed that kubectl 1.29 works with kubelet + api server 1.30

There's no transitional skew guarantee like that for Flux. When you download your new Flux CLI, the next step is bootstrap upgrade and alert your colleagues that it's time to upgrade their CLIs.

The CRD backwards compatibility guarantees of Flux I can't explain in detail right now, but they are based on alpha/beta support guarantees from Kubernetes upstream. Eg. when an API is promoted from alpha to beta, it's considered production ready, and automatic upgrades are supported for one or more versions, until the deprecated API version can be removed. (Whereas an alpha API can have fields that don't make the cut, there's no requirement for a long waiting period before deprecated fields are removed in Alpha. You can have breaking API changes from one version to the next. Automatic upgrades also are not guaranteed from alpha to beta, like beta to v1 does guarantee.)

All Flux APIs are currently beta or later, so those guarantees are in place.

An example where you can have trouble, Flux 2.3 CLI can't work with Flux 2.2 because some deprecated versions are finally removed after years. You will get crazy errors if you try mixing across those versions. On one hand this does not happen very often, (on the other hand it is long in advance predictable that it happens, it does happen, and it will probably happen again.)

gecube commented 4 days ago

Flux 2.3 CLI can't work with Flux 2.2 because some deprecated versions are finally removed after years.

it is totally not understandable, because kubectl works (!)