Run on cgroup v2 host - Githubissues

markusboehme commented 1 year ago

Issue number: #75

Description of changes: This PR aims to make the admin container work on Bottlerocket hosts with cgroup v2/the unified cgroup hierarchy enabled (rather than the current cgroup v1 default). The current version of the admin container is hindered by systemd v219 inside its image not supporting cgroup v2, and no more recent version of systemd being available for the Amazon Linux 2 base image.

The PR introduces a dedicated but unused cgroup v1 hierarchy just for the admin container. It instructs the user systemd process inside the container to use this separate hierarchy via downstream patches. In order to apply those systemd needs to be recompiled during the build process.

The core of this goes back to @bcressey noting the cgroup v1 filesystem can be mounted without any controllers. I take responsibility for actually running with this and deciding to patch systemd inside the container. The benefit of this compared to doing it on the host is that all code required for the admin container actually stays inside the admin container, and no changes at all can be observed on hosts that are not running the admin container.

Testing done: Locally with cgroup v1 and cgroup v2 VMs.

To test with cgroup v2 enabled, add the following user data snippet on boot

[settings.boot]
reboot-to-reconcile = true

[settings.boot.init]
"systemd.unified_cgroup_hierarchy" = ["1"]

or apply the corresponding settings at runtime using apiclient set -j '{"settings": {"boot": {"init": {"systemd.unified_cgroup_hierarchy": ["1"]}}}}' and reboot.

The container image with the changes from this PR can be used both on a cgroup v1 and a cgroup v2 host. Host access using sudo sheltie works just the same. The admin container can be stopped/disabled and restarted/reenabled.

I launched a few pods into a Kubernetes cluster with a node that used the cgroup v2 hierarchy. I then verified on the host that the containers were running in the cgroup v2 hierarchy.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

markusboehme commented 1 year ago

Elaborated on how I tested the change.

markusboehme commented 1 year ago

In my built on aarch64, the new container image weighs in at 417 MB vs 324 MB as reported by docker image ls. A quick tally over the respective images' root directories comes out roughly equal, so I suspect there might be some whiteouts from upper image layers. Potentially the old files of the original systemd RPMs?

markusboehme commented 1 year ago

There is an (experimental) --squash option to docker build to merge image layers and remove redundant files. This does have the desired effect of squeezing the overall image size to 316 MB (8 MB less than the image before this PR). Naturally this will also prevent sharing of image layers, which at least would be helpful for the Amazon Linux 2 base image being shared between the standard admin and control container images.

Another quick win can be had by bind-mounting the newly built RPMs from a previous build stage rather than temporarily copying them in. This gets the image down to 372 MB (45 MB saved, but still an unnecessary 48 MB heavier than before this PR) without the drawbacks of squashing the layers.

I'll update the PR with the bind-mount approach. Perhaps somebody has another idea to squeeze the image some more.

bcressey commented 1 year ago

I'll update the PR with the bind-mount approach. Perhaps somebody has another idea to squeeze the image some more.

systemd isn't in the base container image; it comes in from when the "Development Tools" group is installed. If the layers are structured so that we don't install that group until the end - or potentially not at all, if we replace it with a more targeted package install command - then we could avoid the whiteout of the previous systemd install.

bcressey commented 1 year ago

Oh, there's already a more targeted yum install - I think we just need to install the RPMs from the bind mount before rather than after that point.

markusboehme commented 1 year ago

I'll update the PR with the bind-mount approach. Perhaps somebody has another idea to squeeze the image some more.

systemd isn't in the base container image; it comes in from when the "Development Tools" group is installed. If the layers are structured so that we don't install that group until the end - or potentially not at all, if we replace it with a more targeted package install command - then we could avoid the whiteout of the previous systemd install.

True, I jumped to conclusions here, thanks for noticing! systemd is not in the base image, but actually pulled in as a dependency of the openssh-server package. This means we can cut the dead weight by making sure the downstream systemd build is installed alongside the other packages baked into the admin container image.

markusboehme commented 1 year ago

The earlier force push addresses feedback by @bcressey. Leaving the PR in draft mode until I figure out why the cgroup root lingers on when the admin container is disabled.

markusboehme commented 1 year ago

What I had completely overlooked so far is that the admin container may be run without superpowers (superpowered = false in the user data). It then will not have the necessary permissions to mount a cgroup v1 root by itself in start_admin.sh. While there's other problems in running without superpowers at the moment (#78), this probably tanks the approach of containing all the compatibility work in the admin container.

markusboehme commented 1 year ago

I will merge this tomorrow unless someone speaks up against it. Consensus on #78 is to not worry about the "admin container without superpowers" use case, since it's been broken for a while already without anyone noticing and it not being super useful if it weren't.

bottlerocket-os / bottlerocket-admin-container

Run on cgroup v2 host #76