SovereignCloudStack / cluster-stacks

Definition of Cluster Stacks based on the ClusterAPI ClusterClass feature
https://scs.community/
Apache License 2.0
7 stars 6 forks source link

Mental model behind clusterstacks #96

Open mxmxchere opened 1 month ago

mxmxchere commented 1 month ago

The concept of cluster-stacks boils down to: "pin everything" from the node-image, via the k8s version to the addons and the cluster-class. This gives one the certainty that everything is working the way it was working when it was tested. It is a great technique to achieve reproducible results. However reproducibility is not the only goal that we want to achieve: maintainability is important too (that means we should avoid repeating code, and minimize the amount of code). Another thing that we have to keep in mind are upgrade paths, clusters in SCS should be able to live long (forever) and be seamlessly upgradeable from one version to the next. The current approach "pin-everything" succeeds in achieving reproducibility but performs poorly in the areas "maintainability" and "operations".

Maintainability: In case a cluster-addon, cluster-class or node-image supports multiple kubernetes versions, the current approach still forces us to duplicate the code of the cluster-class, the cluster-addons and strictly speaking also the node images.

Operations: The strict link from cluster-class to kubernetes-version to cluster-addons to images can cause trouble when upgrading a cluster as all components are replaced at once, instead of being replaced one after another and only when necessary.

In addition to that the cluster-stack approach in its current form leverages zero knowledge about kubernetes and basically builds solely on the determenistic assumption that same input (all components are pinned) will result in the same output (the tested, known-good state).

In reality we have knowledge and documentation about the dependencies of the components (compatibility matrices):

Assumption 1 (cluster-class): there is no dependency between the kubernetes version and the clusterclass -> every cluster-class can be used with every kubernetes version and vice versa Assumption 2 (cluster-addon): cluster-addons support not only one specific kubernetes version but a kubernetes version range which spans at least two kubernetes minor versions.

If Assumption 2 is not true for a cluster-addon (as it is officially from OCCM) we should work towards it.

These two assumptions allow us to relax the strict "pin-everything" approach, sacrifice a bit of reproducibility to gain a better maintainability and operatability. In practice this would help us in the following way:

Of course i am not proposing "everything works with everything" (the opposite of "pin everything"). We can, for example with webhooks, make very fine-grained restrictions. Example: "kube-vip does not work with kubernetes version 1.28+". This will also allow us to give nice warnings to the user. Example "Please upgrade cluster-addon x to version y to use kubernetes version z". We can still restrict non-functional combinations. But we will need to gain more knowledge about working ranges. We can use the upstream docs to gain and maintain thesse compatibility matrices and put them into the validating code. Examples for upstream kubernetes-range knowledge:

https://docs.tigera.io/calico/latest/getting-started/kubernetes/requirements#supported-versions https://docs.cilium.io/en/stable/network/kubernetes/compatibility/ https://github.com/kubernetes-sigs/metrics-server?tab=readme-ov-file#compatibility-matrix