Story

As user I want to update my node operating system in-place with only a restart, but no rolling update required

Motivation

We were approached by parties that, for performance reasons, use locally attached disks with data. While the data is replicated, rolling a node and re-building/syncing the local data may take hours. Doing that during a cluster rolling update may therefore take many days, which is difficult for them.

This feature is also useful in cases where the used machine type is scarce (very special machine types) and it isn't easy / guaranteed to get new machines (no reserved instances).

GardenLinux is currently developing the capability to do that. Reminiscent of CoreOS' FastPatch updates, it will have 2 partitions, run on one, prepare the other one, reboot into the other one. Persistent data is stored on yet another and preserved. This may not work with every update, but with many. The GardenLinux developers expect full rolling only to happen later every 1-2 years, but all other updates could be handled in-place once they and we are done.

This ticket here is about Gardener's part, because we do not support in-place OS updates as of now and do need to think it through and do it then, if feasible. Just for historic reference, please see here one of our very first Gardener tickets when we implemented full automated cluster updates (no. 14 for K8s v1.5 -> v1.6 - time flies) and decided at first against FastPatch (https://github.com/gardener/gardener/issues/14).

Labels

/area os /kind enhancement /os garden-linux /topology shoot

Acceptance Criteria

[ ] Node OS updates (probably of something like patch versions to also fit our Kubernetes versioning concept) is done without rolling the nodes
[ ] Ideally, the "dead time" where the kubelet stops posting until it reposts (99 percentile) is shorter than the default machineHealthTimeout of 10m (even better, shorter than the default KCM nodeMonitorGracePeriod of 40s), but that can tweaked (including pod tolerations) by the cluster admins, if not sufficient (still it would be great to achieve a.) if not b.) since "it was said", the rebooting shall take place in seconds)
[ ] ...

Enhancement/Implementation Proposal (optional)

This will require a GEP (https://github.com/gardener/gardener/tree/master/docs/proposals) as conceptional and core changes will be necessary and everything else up until the update of the versioning guide/docs. The question is also what the main actor is, i.e. will we handle this use case like we handle Kubernetes patch updates, i.e. carried out by the maintenance controller? That's probably preferred for multiple reasons (means to opt out, shoot spec lists exact version, time scatter/jiggle resp. coordinated update, etc.) over the OS doing it itself.

Further Considerations

Rolling updates, as side-effects, help with some security obligations (regular fresh start), help building robust solutions (avoiding pet VMs), and the rolling update acts as some sort of safety net: Only when the new node is registered and ready, the old node will be drained and subsequently terminated. In-place updates obviously do not offer this.
Because this is not generally desirable (only in certain cases, e.g. with nodes with local disks or of scarce machine types), it would be best to make the update policy (rolling or in-place) configurable per worker pool, which would require more changes. The maintenance section as of today is for the entire cluster.

Resources (optional)

Contacts: @MalteJ, @gehoern, @MrBatschner, @danielfoehrKn

Definition of Done

[ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
[ ] Unit tests are provided: Have you written automated unit tests?
[ ] Integration tests are provided: Have you written automated integration tests?
[ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
[ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
[ ] User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

gardener / gardener-extension-os-gardenlinux

In-Place Node OS Updates #120