expose /sys/fs/cgroup as writable (for memory and other pressure measures from kernel)

grooverdan commented 2 months ago

Feature request description

PSI, pressure stall information, of the kernel should be available to user containers by default.

Suggest potential solution

cgroupv2 as rw by default.

diff --git a/pkg/specgen/generate/oci_linux.go b/pkg/specgen/generate/oci_linux.go
index d6247bbf6..48cc60275 100644
--- a/pkg/specgen/generate/oci_linux.go
+++ b/pkg/specgen/generate/oci_linux.go
@@ -70,16 +70,7 @@ func getCgroupPermissions(unmask []string) string {
                return ro
        }

-       if len(unmask) != 0 && unmask[0] == "ALL" {
-               return rw
-       }
-
-       for _, p := range unmask {
-               if path.Clean(p) == cgroup {
-                       return rw
-               }
-       }
-       return ro
+       return rw
 }

Have you considered any alternatives?

Running --security-opt unmask=/sys/fs/cgroup exposes it rw is an obvious workaround, but if it can't work around pre-runtime setup of the container, is it acceptable to not make it RO?

If the cpu/memory/device command line options impose limits maybe the mounted cgroup needs to be a submount from there to enforce those limits?

Additional context

/sys/fs/cgroup was originally added as a ro bind mount - cf364703fc3f94cd759cc683e3ab9083e8ecc324. Because rbind exposed host info this became a fresh mount (8d3010d06be33a5c51e1c1c860342209c9a97562). Because this is a fresh mount of a cgroup, the information pertains to the current container and should be writeable (as the permissions on the /sys/fs/cgroup/ files indicate)?

Negative case:

There are a number of cpu/memory/device cgroupv2 limits exposed on the command line and a rw mount might make these able to be exceeded by a container.

github-actions[bot] commented 1 month ago

A friendly reminder that this issue had no activity for 30 days.

Luap99 commented 1 week ago

@giuseppe PTAL, I Assume the cgroup should be read only for security so this is not something we can do by default?

giuseppe commented 1 week ago

A container could create a lot of sub-cgroups, potentially slowing down the kernel. So for the default case, I think it is still safer to mount it read-only and tweak it only when necessary

containers / podman