[Standard] Stabilize node distribution standard

cah-hbaum commented 1 week ago

Follow-up for https://github.com/SovereignCloudStack/standards/pull/524 The goal is to set the Node distribution standard to Stable after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.

The following topics need to be discussed:

[ ] How is node distribution handled on installations with shared-control plane nodes (Kamaji, Gardener, etc) - see e.g. https://github.com/SovereignCloudStack/standards/pull/524#pullrequestreview-2122476212
[ ] What should be done about control-planes with e.g. 3 nodes containing 3 etcd members, which are only distributed on 2 physical machines (and similar scenarios)what to do about control-planes with e.g. 3 control plane nodes and 2 etcd nodes - see e.g. https://github.com/SovereignCloudStack/standards/pull/524#discussion_r1642411303
[ ] Where is the differentiation between Node distribution and things like High Availability or Redundancy? Should this standard only be a precursor for a `High Availability' standard? (more information under #579)
[ ] Should information about external etcd (https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology) be integrated here? (see https://github.com/SovereignCloudStack/standards/pull/524#discussion_r1642540079)

cah-hbaum commented 5 days ago

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

e.g. Kamaji, Gardener, etc

This question was answered in Container Call 2024-06-27:

Standard case kamaji: dedicated controlplane components with shared etcd, everything hosted in k8s (no dedicated nodes), etcd is deployed with antiaffinity (kube-scheduler tries to spread across nodes). Relation of the nodes to each other is unknown to k8s
Gardener:
- Non-HA: single-replica controlplane (dedicated, but hosted in shared seed-cluster).
- HA: (multiple replicas, hosted in seed-cluster but with awareness to tolerate zone failure or node failure) https://gardener.cloud/docs/guides/high-availability/control-plane/#node-failure-tolerance

For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance.

cah-hbaum commented 3 days ago

Topic 2: Differentiation between `Node distribution` and things like `High Availability`, `Redundancy`, etc.

I think to discuss this topic correctly, most of the wording/concepts need to be established first. I'm going to try and find multiple (if different) sources and link them here for different things.

High Availability

The main goal of HA is to avoid downtime, which is the period of time when a system, service, application, cloud service, or feature is either unavailable or not functioning properly. (https://www.f5.com/glossary/high-availability) High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. ... (https://www.cisco.com/c/en/us/solutions/hybrid-work/what-is-high-availability.html) High availability means that we eliminate single points of failure so that should one of those components go down, the application or system can continue running as intended. In other words, there will be minimal system downtime — or, in a perfect world, zero downtime — as a result of that failure. (https://www.mongodb.com/resources/basics/high-availability)

So things termed with High Availability in general try to avoid downtime of their services with the goal of having zero downtime, which is most times not achievable. This can also be seen in this section: ... In fact, this concept is often expressed using a standard known as "five nines," meaning that 99.999% of the time, systems work as expected. This is the (ambitious) desired availability standard that most of us are aiming for. ... (https://www.mongodb.com/resources/basics/high-availability). To achieve these goals, services, hardware or networks are most times provided in a redundant setup, which allows automatic fail-over if instances go down.

Redundancy In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system... )https://en.wikipedia.org/wiki/Redundancy_(engineering)) In cloud computing, redundancy refers to the duplication of certain components or functions of a system with the intention of increasing its reliability and availability. (https://www.economize.cloud/glossary/redundancy)

HINT: WILL BE CONTINUED LATER

martinmo commented 1 day ago

I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters.

One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node.

(I edited above texts accordingly as well to refer to shared control-plane nodes.)

SovereignCloudStack / standards