SovereignCloudStack / standards

SCS standards in a machine readable format
https://scs.community/
Creative Commons Attribution Share Alike 4.0 International
30 stars 21 forks source link

[Standard] Stabilize node distribution standard #639

Open cah-hbaum opened 1 week ago

cah-hbaum commented 1 week ago

Follow-up for https://github.com/SovereignCloudStack/standards/pull/524 The goal is to set the Node distribution standard to Stable after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.

The following topics need to be discussed:

cah-hbaum commented 5 days ago

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

e.g. Kamaji, Gardener, etc

This question was answered in Container Call 2024-06-27:

For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance.

cah-hbaum commented 3 days ago

Topic 2: Differentiation between Node distribution and things like High Availability, Redundancy, etc.

I think to discuss this topic correctly, most of the wording/concepts need to be established first. I'm going to try and find multiple (if different) sources and link them here for different things.


High Availability

The main goal of HA is to avoid downtime, which is the period of time when a system, service, application, cloud service, or feature is either unavailable or not functioning properly. (https://www.f5.com/glossary/high-availability) High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. ... (https://www.cisco.com/c/en/us/solutions/hybrid-work/what-is-high-availability.html) High availability means that we eliminate single points of failure so that should one of those components go down, the application or system can continue running as intended. In other words, there will be minimal system downtime — or, in a perfect world, zero downtime — as a result of that failure. (https://www.mongodb.com/resources/basics/high-availability)

So things termed with High Availability in general try to avoid downtime of their services with the goal of having zero downtime, which is most times not achievable. This can also be seen in this section: ... In fact, this concept is often expressed using a standard known as "five nines," meaning that 99.999% of the time, systems work as expected. This is the (ambitious) desired availability standard that most of us are aiming for. ... (https://www.mongodb.com/resources/basics/high-availability). To achieve these goals, services, hardware or networks are most times provided in a redundant setup, which allows automatic fail-over if instances go down.


Redundancy In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system... )https://en.wikipedia.org/wiki/Redundancy_(engineering)) In cloud computing, redundancy refers to the duplication of certain components or functions of a system with the intention of increasing its reliability and availability. (https://www.economize.cloud/glossary/redundancy)

HINT: WILL BE CONTINUED LATER

martinmo commented 1 day ago

I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters.

One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node.

(I edited above texts accordingly as well to refer to shared control-plane nodes.)