containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.27k stars 2.37k forks source link

Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "partial" isolation #20073

Closed rockdrilla closed 1 year ago

rockdrilla commented 1 year ago

Issue Description

Rootless Podman exposes whole /sys/fs/cgroup/ to container while in "partial" isolation.

$ podman run --rm --network=host docker.io/library/debian ls -l /sys/fs/cgroup
total 0
-r--r--r--  1 nobody nogroup 0 Sep 19 15:15 cgroup.controllers
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.max.depth
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.max.descendants
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.pressure
-rw-r--r--  1 nobody nogroup 0 Sep 19 15:15 cgroup.procs
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.stat
-rw-r--r--  1 nobody nogroup 0 Sep 20 22:26 cgroup.subtree_control
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cgroup.threads
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 cpu.pressure
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cpu.stat
-r--r--r--  1 nobody nogroup 0 Sep 19 15:19 cpuset.cpus.effective
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 cpuset.mems.effective
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 dev-hugepages.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 dev-mqueue.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 init.scope
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.cost.model
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.cost.qos
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.pressure
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 io.prio.class
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 io.stat
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:26 machine.slice
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.numa_stat
-rw-r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.pressure
--w-------  1 nobody nogroup 0 Sep 20 21:01 memory.reclaim
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 memory.stat
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 misc.capacity
-r--r--r--  1 nobody nogroup 0 Sep 20 21:01 misc.current
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-fs-fuse-connections.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-config.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-debug.mount
drwxr-xr-x  2 nobody nogroup 0 Sep 20 22:14 sys-kernel-tracing.mount
drwxr-xr-x 37 nobody nogroup 0 Sep 20 22:33 system.slice
drwxr-xr-x  3 nobody nogroup 0 Sep 20 22:14 user.slice

Correct behavior (achieved with --systemd=always):

$ podman run --rm --network=host --systemd=always docker.io/library/debian ls -l /sys/fs/cgroup
total 0
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.controllers
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.freeze
--w------- 1 root root 0 Sep 20 22:34 cgroup.kill
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.max.depth
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.pressure
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.procs
-r--r--r-- 1 root root 0 Sep 20 22:34 cgroup.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.threads
-rw-r--r-- 1 root root 0 Sep 20 22:34 cgroup.type
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.idle
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.max.burst
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.pressure
-r--r--r-- 1 root root 0 Sep 20 22:34 cpu.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.uclamp.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.uclamp.min
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.weight
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpu.weight.nice
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus
-r--r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Sep 20 22:34 cpuset.mems
-r--r--r-- 1 root root 0 Sep 20 22:34 cpuset.mems.effective
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.bfq.weight
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.latency
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.pressure
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.prio.class
-r--r--r-- 1 root root 0 Sep 20 22:34 io.stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 io.weight
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.current
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.events
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.events.local
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.high
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.low
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.max
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.min
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.numa_stat
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.oom.group
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.peak
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.pressure
--w------- 1 root root 0 Sep 20 22:34 memory.reclaim
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.stat
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.current
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.swap.high
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.swap.max
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.swap.peak
-r--r--r-- 1 root root 0 Sep 20 22:34 memory.zswap.current
-rw-r--r-- 1 root root 0 Sep 20 22:34 memory.zswap.max
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.current
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.events
-rw-r--r-- 1 root root 0 Sep 20 22:34 pids.max
-r--r--r-- 1 root root 0 Sep 20 22:34 pids.peak

Hovewer, /proc/self/mountinfo and /proc/self/cgroup look "sane" (but they're not).

$ podman run --rm --network=host docker.io/library/debian sh -ec 'cat /proc/self/cgroup ; echo ; grep cgroup /proc/self/mountinfo'
0::/

582 580 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
597 582 0:26 /../../../../../.. /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot

Correct behavior:

$ podman run --rm --network=host --systemd=always docker.io/library/debian sh -ec 'cat /proc/self/cgroup ; echo ; grep cgroup /proc/self/mountinfo'
0::/

584 582 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
601 584 0:79 / /sys/fs/cgroup rw,relatime - tmpfs tmpfs rw,size=4k,nr_inodes=1,uid=1000,gid=1000,inode64
602 601 0:26 / /sys/fs/cgroup rw,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot

Steps to reproduce the issue

Steps to reproduce the issue

  1. run container with partial isolation (e.g. --network=host) and with systemd in "auto" mode (i.e. not specifying --systemd=always).
  2. inspect /sys/fs/cgroup/.

Example:

command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"

Describe the results you received

/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/container/memory.max:2147483648
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/memory.max:2147483648

Describe the results you expected

/sys/fs/cgroup/memory.max:2147483648

podman info output

host:
  arch: amd64
  buildahVersion: 1.31.2
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.6+ds1-1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: unknown'
  cpuUtilization:
    idlePercent: 98.05
    systemPercent: 0.4
    userPercent: 1.55
  cpus: 12
  databaseBackend: boltdb
  distribution:
    codename: trixie
    distribution: debian
    version: unknown
  eventLogger: file
  freeLocks: 2029
  hostname: lenovatio
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.5.4-1-mobile
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 13487517696
  memTotal: 31386062848
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: golang-github-containernetworking-plugin-dnsname_1.3.1+ds1-2+b8_amd64
      path: /usr/lib/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
        CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
    package: 'golang-github-containernetworking-plugin-dnsname, containernetworking-plugins:
      /usr/lib/cni'
    path: /usr/lib/cni
  ociRuntime:
    name: crun
    package: crun_1.9-1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.9
      commit: a538ac4ea1ff319bcfe2bf81cb5c6f687e2dc9d3
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.1-1_amd64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 30h 55m 46.00s (Approximately 1.25 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  127.0.0.1:8080:
    Blocked: false
    Insecure: true
    Location: 127.0.0.1:8080
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: 127.0.0.1:8080
    PullFromMirror: ""
  127.0.0.1:8082:
    Blocked: false
    Insecure: true
    Location: 127.0.0.1:8082
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: 127.0.0.1:8082
    PullFromMirror: ""
store:
  configFile: /home/krd/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs_1.10-1_amd64
      Version: |-
        fusermount3 version: 3.14.0
        fuse-overlayfs: version 1.10
        FUSE library version 3.14.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/krd/.local/share/containers/storage
  graphRootAllocated: 485560172544
  graphRootUsed: 370000158720
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /tmp/user/1000
  imageStore:
    number: 154
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/krd/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.2
  Built: 0
  BuiltTime: Thu Jan  1 03:00:00 1970
  GitCommit: ""
  GoVersion: go1.21.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Running rootless Podman:

$ command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

$ podman run --rm -m 2G docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

$ podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/container/memory.max:2147483648
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/user.slice/libpod-dbb91aaa6460164db847500a44c847d27e03f34cd88d61ea0a6b36c318a5a17c.scope/memory.max:2147483648

$ podman run --rm -m 2G --network=host --systemd=always docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

Running rootful Podman:

# command='find /sys/fs/cgroup/ -name memory.max -type f -print0 | sort -zuV | xargs -0r grep -FHxv -e max'

# podman run --rm -m 2G docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

# podman run --rm -m 2G --network=host docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648

# podman run --rm -m 2G --network=host --systemd=always docker.io/library/debian sh -ec "${command}"
/sys/fs/cgroup/memory.max:2147483648
rockdrilla commented 1 year ago

We're already hit by this issue, e.g. https://github.com/nginxinc/docker-nginx/pull/701.

Luap99 commented 1 year ago

@giuseppe PTAL

giuseppe commented 1 year ago

thanks, opened a PR: https://github.com/containers/podman/pull/20086

Please be aware that it fixes only the cgroup mounted on the top of /sys/fs/cgroup. The previous /sys/fs/cgroup coming from the host will still be visible in /proc/self/mountinfo. There is no way to address that because without a netns we cannot mount a fresh sysfs and we are forced to bind mount it from the host. Unprivileged users can only use recursive bind mounts, so we will grab /sys/fs/cgroup from the host as well