SovereignCloudStack / issues

This repository is used for issues that are cross-repository or not bound to a specific repository.
https://github.com/orgs/SovereignCloudStack/projects/6
2 stars 1 forks source link

SCS K8s cluster standardization #181

Closed garloff closed 3 months ago

garloff commented 2 years ago

As DevOps team (=SCS user), I want to have the ability to create and use clusters on many different SCS-compliant container providers, where all relevant properties are either predefined by the SCS standard or can be controlled by a provider-independent cluster-settings.yaml file. Relevant properties are those that tend to create trouble for the application deployment, e.g. k8s versions, CNI features, persistent volumes, ingress/load-balancers, anti-affinity rules (avoiding to have k8s nodes on the same host) ...

These properties should either be fixed by SCS (and then of course only evolve slowly over time) or be controllable by the customer (via a standardized, provider-independent cluster-params.yaml. For the controllable properties, we mandate existence and syntax and we may mandate all or some of the supported options. In any case, the supported options need to be discoverable (and the mechanism for discoverability should include the fixed properties as well).

Note that there is value in standardizing things that are not mandatory, in order for providers to use the same name/semantics for same things. (Obviously optional features may become mandatory for providers in the future if we decide so.)

Hints:

Extensibility: We allow for extensions, but they must be clearly distinguishable from standardized properties.

This epic should list the standardization proposals / ADRs as issues that we as SCS community want to define as SCS-compliant relevant. Some of the proposals might not make it for a v1 of the SCS standard (because they are not ready or deemed not important enough or downgraded to recommendations). The individual proposed properties / ADRs should come with a rationale and with (ideally comprehensive) conformance tests. We want to evolve the reference implementation(s) in parallel to the standardization, but intellectually keep a clear distinction b/w standards and implementation.

We need to create conformance tests for these properties; it is useful to define standards in terms of tests that must pass. (Test-driven standardization!) Obviously, using existing test suites (such as CNCF/sonobouy or aqua/kube-bench) and possibly contributing to them is a good way to do this.

Inspiration for the list below:

Individual topics for standardization:

Networking

Container Registry

Meta

Automation

Identity Management

Logging & Metrics

Security & Robustness

Storage

Tests

Definition of Done:

mbuechse commented 1 year ago

@jschoone @garloff It seems that this existing epic does for the CaaS track what I intended the new epic https://github.com/SovereignCloudStack/standards/issues/285 to do for the IaaS track. I guess it remains to compare the description here with the table https://input.scs.community/tqKlv1Z_Srmi5e5o76CxhQ?view#KaaS-Layer I took from Kurt's slides and maybe update accordingly? For instance, two standards have already been ticked off, even though we still need to implement the conformance tests -- @cah-hbaum will write the corresponding issues, and so I could add those to this epic. Please tell me if disagree to anything I just wrote.

mbuechse commented 1 year ago

Comparison between this epic and the table from Kurt's ALASCA talk slides

Please check what should be added here or what I did wrong @garloff @jschoone.

garloff commented 1 year ago

TL;DR: I want them all to be considered and discussed. Not all of them necessarily become a mandatory standard. Maybe some of them don't even become a recommendation.

Comparison between this epic and the table from Kurt's ALASCA talk slides

* Present in this epic, but missing in the slides (really? or did I just fail to align them?)

  * LBs don't require special annotations (upstream nginx deployment works out of the box): Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box [Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box SovereignCloudStack/issues#212](https://github.com/SovereignCloudStack/issues/issues/212)

The thing here is that nginx upstream uses externalTrafficPolicy: Local and assumes that

(1) The traffic only is routed to the nodes that run the nginx container - which requires a health monitor to be configured which on many LBs (including the octavia one) requires a special annotation or a changed default

(2) The original client IP is visible and not obscured by the LB -- L2/L3 LB instead of L4 Yet, the occm tends to prefer HTTP L7 health checks ... Discussion here is on SovereignCloudStack/issues#212 and numerous subsequent issues, indeed.

  * ControlPlane and Worker machine flavors and counts (translation from SCS flavors needed for non-SCS IaaS?)

For both ControlPlane and Worker Nodes, the number of them and the Flavors need to be configurable. The madatory SCS- Flavors need to be accepted for the latter. (Sidenote: This is a cluster-management feature, not a cluster property -- the latter being something you can rely on once a cluster exists.)

* Present in the slides, but missing in this epic:

  * CNCF conformance tests (not linked to any issue so far)

We have sonobuoy binary installed on the management cluster and run it to test the workload clusters for CNCF conformance. So we have tooling to test CNCF conformance and we want to require CNCF conformance for all clusters.

  * K8s version support period (not linked to any issue so far)
    * note: "Offered K8s version recency" is present as [Supported k8s versions SovereignCloudStack/issues#219](https://github.com/SovereignCloudStack/issues/issues/219)

We have a standard on this: scs-0210-v1. Maybe we need to amend that providers must not drop support for a minor k8s version earlier than upstream does stop the security support (after ~14 months after a release). And maybe we should recommend that for managed clusters, the provider sends a warning to the users when they have a cluster entering the extended support period (after ~12 months) and align the needed upgrades?

  * Identity federation via OIDC, [Understand the requirements towards the IdP Broker to support the container layer SovereignCloudStack/issues#194](https://github.com/SovereignCloudStack/issues/issues/194)
  * Machine identities, [Implement Machine Identities SovereignCloudStack/issues#163](https://github.com/SovereignCloudStack/issues/issues/163)
  * Control plane backup/ maintenance, [etcd maintenance k8s-cluster-api-provider#258](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/issues/258)
  * Kube API access controls, [Add ability to limit access to k8s API k8s-cluster-api-provider#246](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/issues/246)
  * Container registry (opt-in), [Container registry: Create overview of needed and desirable features and map OSS solutions against it. SovereignCloudStack/issues#263](https://github.com/SovereignCloudStack/issues/issues/263)
  * Cluster management API, [SCS K8s cluster standardization SovereignCloudStack/issues#181](https://github.com/SovereignCloudStack/issues/issues/181)
  * Gitops controller for Cluster Mmgt (not linked to any issue so far)

We had some concepts written down for this -- and determined that this should be optional (for the customer). This should become a requirement to the to-be-developed cluster stacks: Have the ability for the cluster-parameters to be pulled from a git repo (using tooling like flux or Argo).

Please check what should be added here or what I did wrong @garloff @jschoone.

I did not check these for completeness, but everything above looks desirable to me.

Note: I believe we have two kind of standards here: (1) What are the properties of the created clusters?

(2) What is the standardized parameter format and API to create, modify and delete clusters?

mbuechse commented 1 year ago

@garloff I amended the description of this issue by everything that hadn't been in there. Maybe we can now go ahead and group the items a bit, like I did in https://github.com/SovereignCloudStack/standards/issues/285.

cah-hbaum commented 1 year ago

I updated the epic and grouped everything a bit more together. But I think in the long run, something like a table would be better, since the "pre"-work for the standard issues is done in other issues or over multiple ones. I can make a table here, so that the whole thing gets grouped better, if that is desired.

cah-hbaum commented 1 year ago

I created individual issues for nearly all points not yet covered by previous issues. I left a few open, since the seemed way too general and broad.

mbuechse commented 1 year ago

@cah-hbaum That sounds great! I also like the new structure in the description above. 👍👍👍

cah-hbaum commented 1 year ago

Short term

Medium term

Long term

Not enough information

Blocked

Already working on

martinmo commented 3 months ago

Closing in favor of https://github.com/SovereignCloudStack/standards/issues/615.