SovereignCloudStack / standards

SCS standards in a machine readable format
https://scs.community/
Creative Commons Attribution Share Alike 4.0 International
30 stars 21 forks source link

Availability Zones: standardized levels of independecies. #539

Open josephineSei opened 3 months ago

josephineSei commented 3 months ago

Availability Zones are a concept in OpenStack.

As a user of an scs-conformal cloud I want to know what i can expect from AZs overall and what is dependent on the CSP.

Definition of Done:

Please refer to scs-0001-v1 for details.

josephineSei commented 3 months ago

Why are we talking about AZs: AZs focus on redundancy and failure safety on IaaS-Level.

While redundancy at the lowest level could be just something like having replication in the storage backend, so there is no data loss in the case of a hardware failure, the requirements can be as hard as having a remote mirror of all data.

  1. AI: We should at least document, what different levels of redundancy means and what failure safety different deployments can provide (@anjastrunk maybe the latter one would be something for gaia-x self-descriptions?)

Pre-Requirements

To also allow having small deployments or edge deployments, that usually only have 1 single AZ, we must not require a certain amount of AZs. Redundancy and Failure safety in that case should be done on the next higher level (PaaS, CaaS, workload...) by the user.

We should rather define and check, for when AZs can be defined and used.

What can AZs be defined of / What can they separate

AZs are logical separations with a chance of physical separation. AZs can be defined:

Problems:

Restrictions:

A good but a bit outdated overview was presented at the Summit in 2018 ( https://www.youtube.com/watch?v=a5332_Ew9JA )

Proposal:

josephineSei commented 3 months ago

I created a hedgedoc for CSPs to talk about their AZ usage: https://input.scs.community/Availability-Zone-Usage#

josephineSei commented 2 months ago

Up until now, there was not much input - so I put it on the agenda again for the next IaaS call

josephineSei commented 1 month ago

A few CSPs answered the questions in the hedgedoc, so we can go on with the work on AZs. There was also a proposal as what to use in the hedgedoc.

The problem here is, when in a deployment AZs are used differently those deployment might not be changed, because change the AZ-architecture is quite fundamental. So all other deployments would be automatically rendered scs-incompatible.

Another option is to use the failsafe levels that will be defined in https://github.com/SovereignCloudStack/standards/issues/527, this would be more vague - we should discuss, whether we want this or not.

josephineSei commented 3 weeks ago

We should base our standard on the Taxonomy DR (https://github.com/SovereignCloudStack/standards/pull/579) and start a requirement analysis from that levels.

Additionally the answers from CSPs in the hedgedoc are quite helpful, as they already proposed some minimum setup for AZs:

AZ definition

    AZs must be in separate fire protection zones
    AZs must have independent power supplies
    AZs must have independent cooling
    AZs should have independent uplinks to the internet
        plusserer: must
    AZs must not depend on a single core router
    AZs must have high bandwidth, low-latency (<3ms RTT) interconnection

We do have to also consider that AZs exit for Compute, Network and Storage independently. And some of them might be easily mitigated by configuration (storage) or are not easily manageable in openstack:

    Compute hosts are always per AZ (in multi-AZ setups)
    Block storage may be global (=per-region) and setup such that it survives AZ failure (preferred option) OR may use the same AZs as compute
    Network service should NOT be per AZ
        To be discussed
        it is a major inconvenience for users to have per-AZ networks
        As network AZ hints are not ignored (despite the name “hint”) in a single-AZ setup, there is no reasonable way to define IaC setups (e.g. with opentofu HCL) that work on both setups with and without network AZs

Requirements

  1. AZs should represent parts of the same deployment, that have an independency of each other
  2. AZs should be able to take workload from another AZ in a Failure Case of Level 3 (in other words: the destruction of one AZ will not automatically include destruction of the other AZs)
    • Compute: resources are bound to one AZ, replication cannot be guaranteed, downtime or loss of resources is most likely
    • Storage: highly depended on storage configuration, replication even over different AZs is part of some storage backends
    • Network: network resources are also stored as configuration pattern in the DB and could be materialized in other parts of a deployment easily as long as the DB is still available.
  3. We should not require AZs to be present (== allow small deployments and edge use cases)

Decisions

  1. AZs should only occur within the same deployment and have an interconnection that represents that (we should not require specific numbers in bandwidth and latency.)
  2. We should separate between AZs for different resources (Compute, Storage, Network)
    • Compute needs AZs (because VMs may be single point of failure) if failure case 3 may occur (part of the deployment is destroyed, if the deployment is small there will be no failure case three, as the whole deployment will be destroyed)
    • Storage should either be replicated over different zones (e.g. fire zones) that are equivalent to compute AZs or also use AZs
    • Network do not need AZs
  3. Power supply may be confused with power line in. Maybe a PDU is what we should talk about - those need to exist for each AZ independently.
  4. When we define fire zone == compute AZ, then every AZ of course has to fulfill the guidelines for a single fire zone. Maybe this should be stated implicitly rather than explicitly.
  5. internet uplinks: after the destruction of one AZ, uplink to the internet must still be possible (that can be done without requiring a separate uplinks for each AZ.)
  6. each AZ should be designed with minimal single point of failures (e.g. single core router) to avoid a situation where a failure of class 2 will disable a whole AZ and so lead to a failure of class 3.
josephineSei commented 6 days ago

In todays IaaS call, we discussed a few open questions:

Network AZ

In the standard I discussed, that it is possible to have Network AZ, but this has downsides for users. Thus i did not make any recommendations. We discussed, whether we even want to discourage CSPs to use it ("SHOULD NOT"):

Cross-Attach AZ

Question was, whether we want to encourgage / allow / discourage or disallow this?

Overall