Open josephineSei opened 3 months ago
Why are we talking about AZs: AZs focus on redundancy and failure safety on IaaS-Level.
While redundancy at the lowest level could be just something like having replication in the storage backend, so there is no data loss in the case of a hardware failure, the requirements can be as hard as having a remote mirror of all data.
To also allow having small deployments or edge deployments, that usually only have 1 single AZ, we must not require a certain amount of AZs. Redundancy and Failure safety in that case should be done on the next higher level (PaaS, CaaS, workload...) by the user.
We should rather define and check, for when AZs can be defined and used.
AZs are logical separations with a chance of physical separation. AZs can be defined:
Problems:
Restrictions:
cross_az_attach
allows or disallows attachment of volumes from other AZs.nova
A good but a bit outdated overview was presented at the Summit in 2018 ( https://www.youtube.com/watch?v=a5332_Ew9JA )
I created a hedgedoc for CSPs to talk about their AZ usage: https://input.scs.community/Availability-Zone-Usage#
Up until now, there was not much input - so I put it on the agenda again for the next IaaS call
A few CSPs answered the questions in the hedgedoc, so we can go on with the work on AZs. There was also a proposal as what to use in the hedgedoc.
The problem here is, when in a deployment AZs are used differently those deployment might not be changed, because change the AZ-architecture is quite fundamental. So all other deployments would be automatically rendered scs-incompatible.
Another option is to use the failsafe levels that will be defined in https://github.com/SovereignCloudStack/standards/issues/527, this would be more vague - we should discuss, whether we want this or not.
We should base our standard on the Taxonomy DR (https://github.com/SovereignCloudStack/standards/pull/579) and start a requirement analysis from that levels.
Additionally the answers from CSPs in the hedgedoc are quite helpful, as they already proposed some minimum setup for AZs:
AZ definition
AZs must be in separate fire protection zones
AZs must have independent power supplies
AZs must have independent cooling
AZs should have independent uplinks to the internet
plusserer: must
AZs must not depend on a single core router
AZs must have high bandwidth, low-latency (<3ms RTT) interconnection
We do have to also consider that AZs exit for Compute, Network and Storage independently. And some of them might be easily mitigated by configuration (storage) or are not easily manageable in openstack:
Compute hosts are always per AZ (in multi-AZ setups)
Block storage may be global (=per-region) and setup such that it survives AZ failure (preferred option) OR may use the same AZs as compute
Network service should NOT be per AZ
To be discussed
it is a major inconvenience for users to have per-AZ networks
As network AZ hints are not ignored (despite the name “hint”) in a single-AZ setup, there is no reasonable way to define IaC setups (e.g. with opentofu HCL) that work on both setups with and without network AZs
In todays IaaS call, we discussed a few open questions:
In the standard I discussed, that it is possible to have Network AZ, but this has downsides for users. Thus i did not make any recommendations. We discussed, whether we even want to discourage CSPs to use it ("SHOULD NOT"):
Question was, whether we want to encourgage / allow / discourage or disallow this?
Availability Zones are a concept in OpenStack.
As a user of an scs-conformal cloud I want to know what i can expect from AZs overall and what is dependent on the CSP.
Definition of Done:
Please refer to scs-0001-v1 for details.
scs-xxxx-v1-slug.md
(only substituteslug
)status
,type
,track
setDraft
, file renamed:xxxx
replaced by document numberDraft
)