Autonomous Shoot Clusters

Gardener became a great tool to create and manage clusters with very low TCO. Part of this success is its design and running control planes in seed clusters under joint supervision by Gardener and Kubernetes. However, there is sufficient pull to also find a way to create autonomous shoot clusters using Gardener (e.g. for the edge or for air-gapped scenarios) where the control plane must run side-by-side and cannot run in seed clusters. This BLI caters to this demand.

Definition

Autonomous shoot clusters do not have their control plane in a seed cluster, but operate it on dedicated control plane nodes in the same network and alongside with the worker nodes, which makes these clusters autonomous (hence the name).

Related or Similar Terms, Term Disambiguation

Autonomous shoot clusters are sometimes confused with untethered shoot clusters, but there is no direct relationship. Tethering and untethering are terms that were introduced to describe whether or not a shoot cluster is managed (at the moment) by Gardener. If it's tethered, life cycle management operations such as updating its spec/Kubernetes or operating system versions, are possible. A tethered shoot cluster would appear under Gardener's "single pane of glass" (so to speak). An untethered shoot cluster, sometimes also called "air-gapped" shoot cluster, is temporarily not managed by Gardener. Today, management cannot be "turned off", but in the future, this could become a deliberate decision to avoid any interference (e.g. while a mission-critical operation is ongoing on customer side/site). In that sense, also regular shoot clusters can be untethered, but that usually happens only in an emergency today, e.g. if the seed cluster lost network connectivity to the garden cluster. However, the term or rather state is therefore not unknown to Gardener (part of our resilience tests). Tethering and untethering become now dedicated terms for the actions that lead to these states. While untethered shoot clusters will be more commonplace with autonomous shoot clusters, those clusters do not have to be untethered, certainly not at all times, and so these terms are not synonymous.

Autonomous shoot clusters are sometimes confused with bare metal/VM shoot clusters, but there is no direct relationship either. You could have bare metal/VM nodes joining also a regular shoot cluster control plane (exclusively or in combination with managed worker nodes; see Gardener Slack channel). So those two things, autonomous shoot clusters and bare metal/VM shoot clusters, are not synonymous. Of course, the means to let bare metal/VM nodes join a cluster will probably become a valuable building block, making autonomous shoot clusters more valuable/opening the use case further up.

Autonomous shoot clusters are sometimes confused with on-prem shoot clusters, but there is no direct relationship either. You could run an on-prem Gardener managing regular shoot clusters on OpenStack or vSphere while on the other hand you could have autonomous shoot clusters on fully managed cloud providers like AWS, Azure, or GCP to be independent from a seed cluster (e.g. to avoid the network traffic, firewall the entire cluster off, avoid runtime dependencies, fulfill compliance obligations, etc.). So those two things, autonomous shoot clusters and on-prem shoot clusters, are not synonymous. Of course, for those customers who do not want to host their own Gardener, but have clusters on-prem, autonomous shoot clusters will become an option and appear synonymous to them, if they don't want to run the control planes managed by us in the cloud (for security and compliance reasons) or simply cannot run (for technical reasons such as network connectivity) the control planes on a remote seed cluster and need it side-by-side with their worker nodes.

Autonomous shoot clusters were also called ~~masterful~~ shoot clusters (in contrast to today's ~~masterless~~ shoot clusters), but this term is no longer political correct.

Why (do we do this)

We want to establish Gardener in (new) environments where clusters cannot run their control plane "somewhere else", but need it side-by-side with their worker nodes, e.g. to avoid the network traffic, firewall the entire cluster off, avoid runtime dependencies, fulfill compliance obligations, etc.

We do not plan to make this a drop-in replacement for k3s or kubeadm (that would be too far away from Gardener's mission statement and inception goals and current implementation), but want to offer this new type of shoot clusters as a separate flavor to reach on-prem use cases that cannot be served with Gardener today.

On-Premise Use Cases

All companies have their own "Enterprise IT" that leverages the cloud or their own DCs/partner DCs or both. Sometimes, companies also have a less experienced "Plant IT", helping with the shop floor IT (anything from small racks to NUCs, mostly carrying out orders or install pre-fabricated packages issued by "Enterprise IT"). These companies use managed Kubernetes services in the cloud, but also Kubernetes distros such as Rancher and OpenShift on-premise, managed by their "Enterprise/Plant IT".

Gardener supports the cloud use case since its inception (zero touch). With autonomous shoot clusters, Gardener likes to establish itself as an option for on-premise use cases as well, such as:

Provisioning of autonomous shoot clusters using a reverse tunnel (independent piece of software) from the public internet:
- Using programmable infrastructures such as OpenStack or vSphere with private cloud profiles (low-touch)
Provisioning of autonomous shoot clusters using no reverse tunnel, i.e. from within the company network:
- Using manually prepared VMs on programmable infrastructures such as OpenStack or vSphere (medium touch) and later registration below Gardener's "single pane of glass" for future LM operations (including VM management, because the programmable infrastructure is available to the autonomous shoot cluster from the "inside")
- Using manually prepared bare metal machines (high touch) and later registration below Gardener's "single pane of glass" for future LM operations (excluding machine management, because that requires manual preparation)

Note: IronCore is not considered bare metal/VM in the context of autonomous shoot clusters (even if IronCore runs on bare metal/VM nodes, but so does also AWS, Azure, and GCP). IronCore is indistinguishable from any other cloud provider from Gardener's point of view. It offers programmable infrastructure components such as networks and virtual machines just like any other cloud provider for Gardener and therefore doesn't need any special handling. Also, we cannot assume to find IronCore at all companies. Bare metal/VM nodes on the other hand are ubiquitous.

Gardener Runtime/Soil Cluster Replacement

Replacing our runtime and soil clusters with our technology has been the "original goal" to avoid the dependency to another Kubernetes service or distro and indeed, it would make Gardener truly independent of others. However, on a more rational level, this by itself is no sufficient reason for this (complex) undertaking. Only few run Gardener installations as they serve the purpose to provision thousands of clusters and those who do can either use an existing managed Kubernetes service, a distro like k3s, or basic tools such as kubeadm (they have the expertise). Also, Gardener will never become a drop-in replacement for k3s or kubeadm (that would be too far away from Gardener's mission statement and inception goals and current implementation).

Bottom line is, to replace the runtime/soil clusters, autonomous shoot clusters are not necessarily needed, because we do not really need to replace them. That means, we should probably avoid to implement a form of autonomous shoot cluster that is totally independent of any running Gardener anywhere as that's probably a lot more difficult to achieve than relying on an existing Gardener. If autonomous shoot clusters can be created from an existing Gardener, in the cloud/on premise/hosted from a notebook kind cluster or whatever, this can and will help us later to either pivot an existing Gardener to this cluster or deploy there a new one and then, in both cases, tether (=claim) said cluster from the Gardener that runs on it, so that it becomes self-hosted. Soil clusters are an even easier goal and will be straight-forward, if we can create (and manage) autonomous shoot clusters from an existing Gardener.

How (do we do this)

In order to compete with already established tools or solutions, it must be easy to set up autonomous shoots clusters or we will never be considered.

Prerequisites, Assumptions, Principles

After having discussed the why, we can now define the boundary conditions and scope. For instance, it will probably be a lot harder to bring up a Gardener autonomous shoot cluster without a Gardener. If that would become a goal (later), we can imagine to map this problem to a local Gardener, e.g. running on a notebook kind cluster. Of course, it's not as slim as a single binary installer, but such an installer would probably take us a lot farther away from our current Gardener code and thereby become a separate piece of code that has only little resemblance with the Gardener we know today (we'd like to avoid a situation that others have been facing, e.g. GKE and GKE-on-prem or Kubermatic and KubeOne being totally different things).

That said, these are our prerequisites, assumptions, and principles (trying to avoid a complete rewrite of Gardener, playing nicely together with Gardener and its architecture):

There is a running Gardener (in the cloud, on-prem, no a notebook), maybe we also will need a supporting seed cluster (for the initial phase of the cluster bootstrapping until the control is handed over to the control plane nodes)
Container images can (and will) be downloaded from our public or any private OCI registry (or directly the RBSC registry or a copy of it hosted by the customer) to run autonomously
For now, we assume we need to set up dedicated control plane nodes, but for resource-constrained installations, a model where workload and control plane share their (probably 1, 2, or 3) nodes seems a reasonable stretch-goal (later)
Autonomous shoot clusters will be required to be installed on managed and unmanaged infrastructures:
- Managed infrastructures include such like the hyperscalers or OpenStack or vSphere and is the easier case.
- Unmanaged infrastructures, i.e. if there is no programmable infrastructure for which we have an infrastructure extension, operating system extension, and MCM driver ready, will require manual effort, which is the more complex case. It will be built upon the related goal of incorporating unmanaged nodes into an existing shoot cluster control plane (independently asked feature for our regular shoot clusters as well), which we think possible by asking only for very few manual steps and then letting everything else be handled by the node agent, retrieving the majority of its configuration/logic from the (see above) running Gardener or a supporting seed cluster.
The autonomous shoot cluster may not depend (during normal operations) on the garden or any seed cluster. It must run independently, even in case of a (limited) disaster (e.g. all control plane nodes go down -> control plane must recover all by itself as long as at least 2 machines come back up again)
We never push information into autonomous shoot cluster other than the cloud config handed through MCM to the managed nodes (and in the case of unmanaged nodes, instructions/configurations will be created for humans to execute manually instead for the unmanaged nodes use case)

Idea

The idea is to transform the shoot cluster into its own seed cluster and host the gardenlet, required extensions, and the control plane pods on dedicated control plane nodes managed by MCM. To survive a complete shutdown/reboot of all nodes, critical components to reestablish the control plane (such as ETCD, KAPI, KCM, KSCH, etc.) would be brought up as static pods, so that the control plane always comes back up and with it everything else that is scheduled as regular pods. As long as 2 control plane nodes are up or can come up, even extension/MCM-based self-healing will also be possible (e.g. provisioning replacement control plane nodes). If more/all is lost, it should be possible to repair the cluster in a similar way as it was created initially, but with an ETCD backup to boot.

Alternatively, we bring out the entire control plane as static pods, if using the shoot cluster control plane for some control plane pods itself isn’t really making life simpler (more changes for the gardenlet, but maybe only the flow – still, it would be definitely double-effort).

Flow

The gardenlet is enhanced with a new separate flow for autonomous shoot clusters. Using a(ny) supporting seed cluster, it starts the flow and the respective cloud provider extension creates the shoot cluster network and MCM creates the control plane nodes. Alternatively, the runtime cluster does that instead, but today it isn't running a gardenlet or any extensions for that matter, except for their web hooks. This second option (creation straight from the runtime cluster) isn't mentioned again and again below, but it remains an option maybe worth considering. However, aside from the already available gardenlet and extensions, a (temporary supporting) seed cluster may be generally the better or even only option, if the autonomous shoot cluster must be brought up using a managed infrastructure such as OpenStack or vSphere behind a firewall.
- MCM must be modified, because it cannot reach the to-be-upcoming control plane directly anymore (the shoot cluster control plane is no longer running in the seed cluster, so direct access may be impossible from afar). I.e. it must cope w/o a shoot cluster kubeconfig and may only react to cloud provider response codes, but may not actively wait for a node resource to manifest and turn ready.
- Today we create a secret in the shoot cluster control plane with the configuration for the node agent to download. Access is granted via a bootstrap token that MCM has created. MCM cannot do that either anymore and there is also no shoot cluster control plane anymore anyway before the majority of the control plane nodes come up. We could now create said secret in the seed cluster in a temporary namespace for the autonomous shoot cluster (where also MCM and other components run/have their resources during the bootstrapping process) and the initial cloud config for the node agent fetches it from the seed cluster instead.
- Also ETCD druid must be modified to generate the static pod manifests for the static pod ETCDs (or we could maybe map today's regular/generated manifests to static pod manifests without touching ETCD druid, but over time we will have to integrate more deeply, see coming operations, member resources, steward, etc.).
- In case of the ETCD static pods, they will have to have a dedicated identity (0, 1, 2), so possibly we need different ones for the individual control plane nodes (or we find other ways to pass on agree on the identity).
The initial client certs could be generated in the supporting seed cluster and used to bootstrap control plane. Later, the regular mechanism via GRM and secrets in the shoot cluster control plane could be used (that is an advantage, but the disadvantage is that this may be more brittle and the safer approach would be to let the node agents generate the client certs automatically on-the-fly with the KAPI's CA, but then those need to be included in the credentials rotation flow; then again, credentials rotation includes the KAPIs CA and other static pod kubeconfigs as well, so we need to rotate files on the disk anyway).
All other credentials that are not temporary, i.e. not only required during the bootstrap process (e.g. the KAPI's CA) must be moved into/adopted later by the autonomous shoot cluster control plane like the control plane node MCM machine deployment will be.
If in some cases, a "respective cloud provider extension" for programmable infrastructures such as OpenStack can be leveraged, but access to the API endpoint is behind a firewall, the customer can decide whether to deploy a selective reverse proxy (independent component) to give us access from the outside or whether they want to turn this into a bare metal/VM case instead where they provide the machines manually (we would not really care/distinguish between manually prepared OpenStack Nova machines or some NUCs on the shop floor - all what matters is whether we can leverage a proper cloud provider extension that treats machines as cattle or the one for bare metal/VM nodes that treats machines as pets).
In that sense, "respective cloud provider extension" includes also a do-it-yourself (bare metal/VM) provider extension that allows the users to setup the network themselves and also prepare the control plane machines, which are then fully bootstrapped with the help of the node agent (finishing the node bootstrapping process with the help of configuration/logic available via the supporting seed cluster).
In the case of autonomous shoot clusters, the node agent configuration/logic (for programmable infrastructures or for bare metal/VM machines) will have different configuration/logic to bring up the control plane as static pods and also a gardenlet as static pod (with yet another flow resp. the continuation of the aforementioned first segment of the flow for autonomous shoot clusters that ran in the supporting seed cluster initially to bootstrap the autonomous shoot cluster). The reason why the gardenlet should probably be run as static pod is that we anyway need a component to conclude the bring-up of the (full) control plane with all add-ons and the like. So, instead of having another component, we can let the gardenlet, when run as a static pod, finish that job. It will later also update the shoot cluster as long as Gardener is accessible and that's enabled/desired by the cluster admin. If the cluster is air-gapped and/or Gardener is not available to it (voluntarily or involuntarily), it will uphold the last known desired state. If the gardenlet would get deployed as a regular pod instead, this would make things more complicated and we would need an additional component to complete the bootstrapping or repair a broken state (e.g. because of a Gardener bug that broke the autonomous shoot cluster control plane; the earlier the rectifying component sits in the machine boot and total update process, here the gardenlet as a static pod, the more likely it is that a broken state can be later fixed automatically by releasing a new version). By this rational, also MCM could be run as static pod, helping to repair even control plane nodes (not applicable/can be omitted with manually provisioned machines).
- Before the gardenlet in the shoot cluster can start the extensions (e.g. cloud provider extension) or adopt the control plane node MCM machine deployment or anything else really, the gardenlet in the supporting seed cluster must have written the complete and last infrastructure state and all other resources to the garden cluster (much like it happens with seed clusters hosted on unmanaged soil clusters today). Those can then be taken (adopted) by the gardenlet in the shoot cluster.
- Handover/hand-shake could be similar to CPM (the case is somewhat similar even), i.e. during the initial phase, the gardenlet in the supporting seed cluster reacts according to the seedName in the shoot's spec. When the gardenlet in the shoot cluster, running on the control plane nodes, stamps itself into the shoot's status, the supporting seed cluster gardenlet knows that it is done and can cleanup the temporary shoot cluster namespace and forget about this autonomous shoot cluster altogether. At least that's the plan for now - more or less (possibly we will create new fields for that later/clean up seedName, etc.).
It seems beneficial, generally and even more so in the air-gapped scenario, to download all relevant container images to all control plane nodes and let them form a container image registry cache from which also the later worker nodes can pull their images. Then, no runtime dependency exists to a container registry that may later not be available when needed (only relevant for workload that may move between nodes, e.g. CoreDNS, as every workload that's pinned to a node, e.g. KAPI, ETCD, etc., must be present from the start). This would also help reducing our Gardener egress traffic bill for pulled container images.
Today, the components in the seed cluster (e.g. kube-scheduler) talk to the KAPI via a K8s service and the components in the shoot cluster (e.g. kubelets) talk to the KAPI via an LB using an (internal) DNS record. Both options won't be available for autonomous shoot clusters (w/o a managed infrastructure, i.e. we cannot rely on a K8s service or LBs, DNS, EIPs, or similar). The target picture is that we work with the control plane node IPs only. The control plane nodes could advertise themselves as resources in the control plane. Then the worker node agents could read them and configure iptables forwarding rules to them (those would be automatically updated if the control plane nodes are rolled/replaced). The kubelet would be started (and keeps running without restarts) with a static kubeconfig with an IP that is forwarded to said actual control plane node IPs.
The audit log extension should be optional, generically configurable, and should probably implement store & forward to never lose logs while air-gapped (if not already the case).
The observability stack should be optional and would be installed as regular pods (onto the worker nodes). It could then be used also by the end user for their workload (if it auto-scales).
The VPN is no longer needed. The apiserver-proxy is no longer needed. The DWD is probably no longer needed (or must be rewritten), because everything runs within the same network.
From here onwards, so the theory, everything is more or less the same (or we will find out soon enough). :-)

Proof of Concept

To test the waters/experiment with the idea, we should "build" a proof of concept, short-cutting our way to the first autonomous shoot cluster in the following way:

Pick a supporting seed for the infrastructure where we want to create our first manually crafted autonomous shoot cluster.
Manually create the resources that normally the gardenlet creates, so that the a corresponding cloud provider extension creates the necessary resources for network, etc.
Prepare custom cloud config and corresponding configuration for the node agent to bring up gardenlet (later, not now, because we need a special build/fork that we do not have yet) and control plane static pods on the to be created control plane nodes.
Manually create the control plane nodes machine deployment with the manually crafted custom cloud config downloading the manually crafted configuration for the node agent from the above supporting seed. Use a "workerless" shoot cluster kubeconfig and an infinite creation timeout, so that MCM doesn't tear down the machines.
We cannot make this work w/o being aided by the ETCD colleagues, because of the ETCD inter-container handshake and the dependency on the seed cluster to write leases or find the peers, so let's start with simple ETCD static pod manifests by ourselves w/o backup and restore (w/o druid, by ourselves, until clarified, see below).
We start with an external LB for the KAPI (like in the first days of Gardener), but we keep in mind that we cannot rely on LBs, DNS, EIPs, or similar in our implementation/code (e.g. for the bring-up of worker nodes) and must find a solution (later) to work with the control plane node IPs only (eventually).
When the control plane is up an running, manually deploy:
- Additional components required for autonomous shoot clusters like:
- The corresponding (infrastructure, operating system, networking, etc.) extensions
- MCM
- Whatever the regular gardenlet's reconciliation flow would create, e.g.:
- Worker nodes
- Add-ons
Manually copy over the control plane node machine deployment and let MCM adopt it
If that works, shake it:
- What happens if we terminate a control plane node? Will MCM bring up another one and get ready?
- What happens if we reboot/shutdown and restart all control plane nodes together? Will the control plane and cluster come up automatically?
- What happens if we trigger a reconciliation/maintenance? Will it work?
- How do we repair such an autonomous shoot cluster, if we terminate all control plane nodes together? Which steps need to be repeated, but modified, so that the cluster bootstraps from the ETCD backup that now exists?

What

The execution plan is not yet defined. We are still in early discussions.

However, once those are concluded, we'd like to "build" such an autonomous shoot manually to have a proof of concept that can help us to understand the complexities and thereby inform our next steps, but also to encourage/motivate us, if successful... or stop all discussions around this topic for good, if not successful. See the proof of concept section above.

So far, we see the following (sub) topics that will have to be addressed for autonomous shoot clusters, e.g. starting with them in a hackathon:

Register bare metal/VM nodes into an existing cluster (probably not a mixture of nodes, managed and unmanaged, but only unmanaged to leave the network challenge to the end users)
Refactor the gardenlet roll-out from push to pull, i.e. landscape-setup shall not update a soil gardenlet after first installation and the soil gardenlet shall not update the seed gardenlets through the seed API server after first installation and instead all gardenlets, once installed, shall update themselves from the garden cluster just like extensions are update by the gardenlet from the garden cluster

In addition, we need to have an early discussion with the ETCD team to discuss the autonomous shoot cluster scenario and how to solve it with the druid, its coming operations, the member resource (replacing the lease resources), the steward, and in general today's dependency to a running seed cluster that the druid cannot rely upon anymore (and neither can it assume the shoot cluster control plane to be up, because it's based on ETCD). This may be a principle problem. Possibly, we need to invent some other synchronization solution via the static ETCD druid/steward pods (w/o Kubernetes as persistence), e.g. via peer-to-peer communication (much like the ETCDs have their own peer-to-peer communication to get themselves going). Besides this question, there is also the general question how to deal with backups and what to offer in case there is no blob store available (internally) or desired (externally), e.g. should we offer to write the backups also to other sinks like a (network) file share, so that the customer can automate the backup process further and we provide instructions for restore (resp. restore from that file share)?

gardener / gardener