`podman run --cgroups=disabled` still changes cgroups

hvenev-insait commented 9 months ago

Issue Description

Running a rootless container with --cgroups=disabled still creates a new cgroup and runs the container in it.

The issue can be reproduced in Podman 4.7.2 and 4.8.0 on Debian and Fedora.

Steps to reproduce the issue

Invoke podman run --rm --network=none --cgroups=disabled --cgroupns=host docker.io/library/debian:testing-20231120 sh -c 'cat /proc/$$/cgroup'
Inspect the output of the command

Describe the results you received

The cgroup is different from that of the process that invoked podman run:

- /user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f22f14d2-3d44-4304-9879-3fdd0df7b08e.scope
+ /user.slice/user-10000.slice/user@10000.service/user.slice/podman-239302.scope

Describe the results you expected

The cgroup is the same as that of the process that invoked podman run.

podman info output

host:
  arch: amd64
  buildahVersion: 1.33.2
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-2.fc39.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: '
  cpuUtilization:
    idlePercent: 97.92
    systemPercent: 0.37
    userPercent: 1.7
  cpus: 20
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    version: "39"
  eventLogger: journald
  freeLocks: 2030
  hostname: iws.m.venev.name
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 10000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 10000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.6.3-1-hvenev-baremetal.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 7955566592
  memTotal: 67138625536
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
    package: netavark-1.9.0-1.fc39.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.9.0
  ociRuntime:
    name: crun
    package: crun-1.12-1.fc39.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.12
      commit: ce429cb2e277d001c2179df1ac66a470f00802ae
      rundir: /run/user/10000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: false
    path: /run/user/10000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 840691712
  swapTotal: 4691324928
  uptime: 143h 56m 34.00s (Approximately 5.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/hristo/.config/containers/storage.conf
  containerStore:
    number: 11
    paused: 0
    running: 0
    stopped: 11
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/hristo/.local/share/containers/storage
  graphRootAllocated: 1915555418112
  graphRootUsed: 699607498752
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 99
  runRoot: /run/user/10000/containers
  transientStore: false
  volumePath: /home/hristo/.local/share/containers/storage/volumes
version:
  APIVersion: 4.8.0
  Built: 1701165512
  BuiltTime: Tue Nov 28 11:58:32 2023
  GitCommit: ""
  GoVersion: go1.21.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.8.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

No response

Additional information

No response

rhatdan commented 9 months ago

@giuseppe PTAL

giuseppe commented 9 months ago

can you show the owner of the current cgroup?

$ id -u
1000
$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
drwxr-xr-x. 2 1000 1000 0 Dec  5 22:27 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope

$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope

hvenev-insait commented 9 months ago

The issue happens both when the owner is the current user:

$ id -u
10000
$ cat /proc/self/cgroup
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
drwxr-xr-x 2 10000 10000 0 Dec  5 17:19 /sys/fs/cgroup/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-246968.scope

and when it's root:

$ machinectl shell root@
# su - user
$ cat /proc/self/cgroup
1:name=systemd:/
0::/user.slice/user-0.slice/session-50.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-0.slice/session-50.scope
drwxr-xr-x 2 0 0 0 Dec  6 08:18 /sys/fs/cgroup/user.slice/user-0.slice/session-50.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-247173.scope

Hm, is that 1:name=systemd:/ cgroup causing issues? It looks like UserOwnsCurrentSystemdCgroup might be getting confused.

hvenev-insait commented 9 months ago

Yes, it appears that cleaning up the legacy name=systemd hierarchy fixes the issue when the owner of the cgroup is the current user:

$ cat /proc/self/cgroup 
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a121e3d9-a08f-4c31-a469-487c7a44111f.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a121e3d9-a08f-4c31-a469-487c7a44111f.scope

However, it still doesn't work if the owner is root:

$ machinectl shell root@
# su - user
$ cat /proc/self/cgroup 
0::/user.slice/user-0.slice/session-59.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-249730.scope

hvenev-insait commented 6 months ago

The issue persists in podman 4.9.3.

$ podman --version
podman version 4.9.3
$ cat /proc/self/cgroup
0::/user.slice/user-10000.slice/session-c30.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-10000.slice/session-c30.scope
drwxr-xr-x 2 0 0 0 Mar  9 13:32 /sys/fs/cgroup/user.slice/user-10000.slice/session-c30.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-1072248.scope

giuseppe commented 6 months ago

that is kind of expected, the cgroups=disabled is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.

hvenev-insait commented 6 months ago

that is kind of expected, the cgroups=disabled is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.

That would be even better -- making sure that the container dies when its cgroup is killed would be great.

giuseppe commented 6 months ago

that is kind of expected, the cgroups=disabled is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.

That would be even better -- making sure that the container dies when its cgroup is killed would be great.

that can leak resources, we need to make sure podman container cleanup runs

hvenev-insait commented 6 months ago

Then maybe podman/conmon could move itself to another cgroup after spawning the container, not before?

giuseppe commented 6 months ago

Then maybe podman/conmon could move itself to another cgroup after spawning the container, not before?

what would be the advantage? We will still need to migrate the process to a new cgroup.

hvenev-insait commented 6 months ago

The advantage is that the processes in the container would be running in the original cgroup from which podman was called, not in the new cgroup conmon was moved to. When the original cgroup gets killed, all processes inside the container will die, and conmon will survive and can invoke podman to do cleanup.

Furthermore, existing resource limits on the original cgroup would apply to most of the container, with the exception of conmon, which I assume uses very little resources.

giuseppe commented 6 months ago

moving conmon later introduces a race condition when the current cgroup is destroyed before conmon was moved.

Not sure what the cleanest way to solve this would be, maybe having podman --cgroup-manager=disabled that disables the mechanism of moving podman, then it is the user responsibility to enable that. On the other hand, not moving podman when disabled is specified could be the less surprising behavior. I am currently playing with the patch below.

diff --git a/cmd/podman/common/completion.go b/cmd/podman/common/completion.go
index 7b95287a5..2de4151e8 100644
--- a/cmd/podman/common/completion.go
+++ b/cmd/podman/common/completion.go
@@ -58,8 +58,9 @@ func setupContainerEngine(cmd *cobra.Command) (entities.ContainerEngine, error)
    }
    if !registry.IsRemote() {
        _, noMoveProcess := cmd.Annotations[registry.NoMoveProcess]
+       cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String()

-       err := containerEngine.SetupRootless(registry.Context(), noMoveProcess)
+       err := containerEngine.SetupRootless(registry.Context(), noMoveProcess, cgroupMode)
        if err != nil {
            return nil, err
        }
diff --git a/cmd/podman/root.go b/cmd/podman/root.go
index aaa992b67..4cdfdb019 100644
--- a/cmd/podman/root.go
+++ b/cmd/podman/root.go
@@ -363,7 +363,8 @@ func persistentPreRunE(cmd *cobra.Command, args []string) error {
    _, found := cmd.Annotations[registry.ParentNSRequired]
    if !registry.IsRemote() && !found {
        _, noMoveProcess := cmd.Annotations[registry.NoMoveProcess]
-       err := registry.ContainerEngine().SetupRootless(registry.Context(), noMoveProcess)
+       cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String()
+       err := registry.ContainerEngine().SetupRootless(registry.Context(), noMoveProcess, cgroupMode)
        if err != nil {
            return err
        }
diff --git a/pkg/domain/entities/engine_container.go b/pkg/domain/entities/engine_container.go
index 15cf309bf..712fdd7cd 100644
--- a/pkg/domain/entities/engine_container.go
+++ b/pkg/domain/entities/engine_container.go
@@ -95,7 +95,7 @@ type ContainerEngine interface { //nolint:interfacebloat
    PodUnpause(ctx context.Context, namesOrIds []string, options PodunpauseOptions) ([]*PodUnpauseReport, error)
    Renumber(ctx context.Context) error
    Reset(ctx context.Context) error
-   SetupRootless(ctx context.Context, noMoveProcess bool) error
+   SetupRootless(ctx context.Context, noMoveProcess bool, cgroupMode string) error
    SecretCreate(ctx context.Context, name string, reader io.Reader, options SecretCreateOptions) (*SecretCreateReport, error)
    SecretInspect(ctx context.Context, nameOrIDs []string, options SecretInspectOptions) ([]*SecretInfoReport, []error, error)
    SecretList(ctx context.Context, opts SecretListRequest) ([]*SecretInfoReport, error)
diff --git a/pkg/domain/infra/abi/system_freebsd.go b/pkg/domain/infra/abi/system_freebsd.go
index c6ec91943..1521a7e1a 100644
--- a/pkg/domain/infra/abi/system_freebsd.go
+++ b/pkg/domain/infra/abi/system_freebsd.go
@@ -8,6 +8,6 @@ import (
 const defaultRunPath = "/var/run"

 // SetupRootless in a NOP for freebsd as it only configures the rootless userns on linux.
-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
    return nil
 }
diff --git a/pkg/domain/infra/abi/system_linux.go b/pkg/domain/infra/abi/system_linux.go
index abe00d89a..baaef396b 100644
--- a/pkg/domain/infra/abi/system_linux.go
+++ b/pkg/domain/infra/abi/system_linux.go
@@ -17,7 +17,7 @@ import (
 // Default path for system runtime state
 const defaultRunPath = "/run"

-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
    runsUnderSystemd := systemd.RunsOnSystemd()
    if !runsUnderSystemd {
        isPid1 := os.Getpid() == 1
@@ -30,30 +30,33 @@ func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool)
        }
    }

-   // do it only after podman has already re-execed and running with uid==0.
-   hasCapSysAdmin, err := unshare.HasCapSysAdmin()
-   if err != nil {
-       return err
-   }
-   // check for both euid == 0 and CAP_SYS_ADMIN because we may be running in a container with CAP_SYS_ADMIN set.
-   if os.Geteuid() == 0 && hasCapSysAdmin {
-       ownsCgroup, err := cgroups.UserOwnsCurrentSystemdCgroup()
+   configureCgroup := cgroupMode != "disabled" && cgroupMode != "split"
+   if configureCgroup {
+       // do it only after podman has already re-execed and running with uid==0.
+       hasCapSysAdmin, err := unshare.HasCapSysAdmin()
        if err != nil {
-           logrus.Infof("Failed to detect the owner for the current cgroup: %v", err)
+           return err
        }
-       if !ownsCgroup {
-           conf, err := ic.Config(context.Background())
+       // check for both euid == 0 and CAP_SYS_ADMIN because we may be running in a container with CAP_SYS_ADMIN set.
+       if os.Geteuid() == 0 && hasCapSysAdmin {
+           ownsCgroup, err := cgroups.UserOwnsCurrentSystemdCgroup()
            if err != nil {
-               return err
+               logrus.Infof("Failed to detect the owner for the current cgroup: %v", err)
            }
-           unitName := fmt.Sprintf("podman-%d.scope", os.Getpid())
-           if runsUnderSystemd || conf.Engine.CgroupManager == config.SystemdCgroupsManager {
-               if err := systemd.RunUnderSystemdScope(os.Getpid(), "user.slice", unitName); err != nil {
-                   logrus.Debugf("Failed to add podman to systemd sandbox cgroup: %v", err)
+           if !ownsCgroup {
+               conf, err := ic.Config(context.Background())
+               if err != nil {
+                   return err
+               }
+               unitName := fmt.Sprintf("podman-%d.scope", os.Getpid())
+               if runsUnderSystemd || conf.Engine.CgroupManager == config.SystemdCgroupsManager {
+                   if err := systemd.RunUnderSystemdScope(os.Getpid(), "user.slice", unitName); err != nil {
+                       logrus.Debugf("Failed to add podman to systemd sandbox cgroup: %v", err)
+                   }
                }
            }
+           return nil
        }
-       return nil
    }

    pausePidPath, err := util.GetRootlessPauseProcessPidPath()
diff --git a/pkg/domain/infra/tunnel/system.go b/pkg/domain/infra/tunnel/system.go
index 492fd0a89..f091fc79c 100644
--- a/pkg/domain/infra/tunnel/system.go
+++ b/pkg/domain/infra/tunnel/system.go
@@ -13,7 +13,7 @@ func (ic *ContainerEngine) Info(ctx context.Context) (*define.Info, error) {
    return system.Info(ic.ClientCtx, nil)
 }

-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
    panic(errors.New("rootless engine mode is not supported when tunneling"))
 }

hvenev-insait commented 3 months ago

This just leads to crashes:

$ podman ps 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1766a2a]

goroutine 1 [running]:
main.persistentPreRunE(0x2d89da0, {0x2eb50c0, 0x0, 0x0})
    github.com/containers/podman/cmd/podman/root.go:366 +0xaaa
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).execute(0x2d89da0, {0xc0000400b0, 0x0, 0x0})
    github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:954 +0x951
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x2d68ca0)
    github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).Execute(...)
    github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1039
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).ExecuteContext(...)
    github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1032
main.Execute()
    github.com/containers/podman/cmd/podman/root.go:115 +0xb4
main.main()
    github.com/containers/podman/cmd/podman/main.go:60 +0x452

(cmd/podman/root.go:366 is cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String())

giuseppe commented 3 months ago

thanks for trying it.

Fixed that issue and opened a PR:

https://github.com/containers/podman/pull/22850

containers / podman