SURFscz / SRAM-deploy

Deploy scripts for the SCZ
Apache License 2.0
5 stars 5 forks source link

Fix docker cgroupv2 compatibility #357

Closed mrvanes closed 6 months ago

mrvanes commented 1 year ago

This PR finally resolves cgroupv2 compatibility for our deploy. I can now do without systemd.unified_cgroup_hierarchy=false in my GRUB_CMDLINE_LINUX boot param.

It requires a docker daemon configuration file /etc/docker/daemon.json

{
  "features": { "buildkit": true },
  "experimental": true,
  "exec-opts": ["native.cgroupdriver=systemd"],
  "cgroup-parent": "docker.slice"
}

But it also requires the containers to be privileged, which is a sad side-effect of docker mounting /sys/fs/cgroup ro when unprivileged, even in a private cgroup slice. And because of that, we need to disable the tty's in de docker, because otherwise the host explodes.

https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva

baszoetekouw commented 1 year ago

I'm not keen on running the containers as privileged, as it essentially removes separation between host and containers.

Systtemd is also quite explciit on how this should work:

Either pre-mount all cgroup hierarchies in full into the container, or leave that to systemd which will do so if they are missing. Note that it is explicitly not OK to just mount a sub-hierarchy into the container as that is incompatible with /proc/$PID/cgroup (which lists full paths). Also the root-level cgroup directories tend to be quite different from inner directories, and that distinction matters. It is OK however, to mount the "upper" parts read-only of the hierarchies, and only allow write-access to the cgroup subtree the container runs in. It's also a good idea to mount all controller hierarchies with exception of "name=systemd" fully read-only, to protect the controllers from alteration from inside the containers. Or to turn this around: only the cgroup subtree of the container itself in the name=systemd hierarchy must be writable to the container.

(https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/)

The nicer solution seesm to be to explicitly use user namespacing: https://docs.docker.com/engine/security/userns-remap/ If that's enabled, using privileged containers is probably fine.

baszoetekouw commented 1 year ago

As discussed, real fix is to move to properly dockerized apps instead of containers-as-vms

baszoetekouw commented 6 months ago

Ik ga deze nu sluiten; onze nieuwe docker-oplossing zou dit moeten fixen lijkt me?