Closed hvenev-insait closed 3 months ago
@giuseppe PTAL
can you show the owner of the current cgroup?
$ id -u
1000
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
drwxr-xr-x. 2 1000 1000 0 Dec 5 22:27 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-94220382-1f1f-4715-ba8e-dc972a17a057.scope
The issue happens both when the owner is the current user:
$ id -u
10000
$ cat /proc/self/cgroup
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
drwxr-xr-x 2 10000 10000 0 Dec 5 17:19 /sys/fs/cgroup/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3f83f4ee-c1d2-4189-aacf-4a1329475b7e.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-246968.scope
and when it's root:
$ machinectl shell root@
# su - user
$ cat /proc/self/cgroup
1:name=systemd:/
0::/user.slice/user-0.slice/session-50.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-0.slice/session-50.scope
drwxr-xr-x 2 0 0 0 Dec 6 08:18 /sys/fs/cgroup/user.slice/user-0.slice/session-50.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
1:name=systemd:/
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-247173.scope
Hm, is that 1:name=systemd:/
cgroup causing issues? It looks like UserOwnsCurrentSystemdCgroup
might be getting confused.
Yes, it appears that cleaning up the legacy name=systemd
hierarchy fixes the issue when the owner of the cgroup is the current user:
$ cat /proc/self/cgroup
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a121e3d9-a08f-4c31-a469-487c7a44111f.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-a121e3d9-a08f-4c31-a469-487c7a44111f.scope
However, it still doesn't work if the owner is root:
$ machinectl shell root@
# su - user
$ cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-59.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-249730.scope
The issue persists in podman 4.9.3.
$ podman --version
podman version 4.9.3
$ cat /proc/self/cgroup
0::/user.slice/user-10000.slice/session-c30.scope
$ ls -lnd /sys/fs/cgroup/user.slice/user-10000.slice/session-c30.scope
drwxr-xr-x 2 0 0 0 Mar 9 13:32 /sys/fs/cgroup/user.slice/user-10000.slice/session-c30.scope
$ podman run --rm --network=none --cgroups=disabled --cgroupns=host fedora sh -c 'cat /proc/$$/cgroup'
0::/user.slice/user-10000.slice/user@10000.service/user.slice/podman-1072248.scope
that is kind of expected, the cgroups=disabled
is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.
that is kind of expected, the
cgroups=disabled
is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.
That would be even better -- making sure that the container dies when its cgroup is killed would be great.
that is kind of expected, the
cgroups=disabled
is passed down to the OCI runtime, but still Podman creates its own cgroup if the user doesn't own the current one. One reason is that these cgroups might be destroyed, when the session exits, even if lingering is set for the rootless user.That would be even better -- making sure that the container dies when its cgroup is killed would be great.
that can leak resources, we need to make sure podman container cleanup
runs
Then maybe podman/conmon could move itself to another cgroup after spawning the container, not before?
Then maybe podman/conmon could move itself to another cgroup after spawning the container, not before?
what would be the advantage? We will still need to migrate the process to a new cgroup.
The advantage is that the processes in the container would be running in the original cgroup from which podman
was called, not in the new cgroup conmon
was moved to. When the original cgroup gets killed, all processes inside the container will die, and conmon
will survive and can invoke podman
to do cleanup.
Furthermore, existing resource limits on the original cgroup would apply to most of the container, with the exception of conmon
, which I assume uses very little resources.
moving conmon later introduces a race condition when the current cgroup is destroyed before conmon was moved.
Not sure what the cleanest way to solve this would be, maybe having podman --cgroup-manager=disabled
that disables the mechanism of moving podman, then it is the user responsibility to enable that. On the other hand, not moving podman when disabled
is specified could be the less surprising behavior. I am currently playing with the patch below.
diff --git a/cmd/podman/common/completion.go b/cmd/podman/common/completion.go
index 7b95287a5..2de4151e8 100644
--- a/cmd/podman/common/completion.go
+++ b/cmd/podman/common/completion.go
@@ -58,8 +58,9 @@ func setupContainerEngine(cmd *cobra.Command) (entities.ContainerEngine, error)
}
if !registry.IsRemote() {
_, noMoveProcess := cmd.Annotations[registry.NoMoveProcess]
+ cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String()
- err := containerEngine.SetupRootless(registry.Context(), noMoveProcess)
+ err := containerEngine.SetupRootless(registry.Context(), noMoveProcess, cgroupMode)
if err != nil {
return nil, err
}
diff --git a/cmd/podman/root.go b/cmd/podman/root.go
index aaa992b67..4cdfdb019 100644
--- a/cmd/podman/root.go
+++ b/cmd/podman/root.go
@@ -363,7 +363,8 @@ func persistentPreRunE(cmd *cobra.Command, args []string) error {
_, found := cmd.Annotations[registry.ParentNSRequired]
if !registry.IsRemote() && !found {
_, noMoveProcess := cmd.Annotations[registry.NoMoveProcess]
- err := registry.ContainerEngine().SetupRootless(registry.Context(), noMoveProcess)
+ cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String()
+ err := registry.ContainerEngine().SetupRootless(registry.Context(), noMoveProcess, cgroupMode)
if err != nil {
return err
}
diff --git a/pkg/domain/entities/engine_container.go b/pkg/domain/entities/engine_container.go
index 15cf309bf..712fdd7cd 100644
--- a/pkg/domain/entities/engine_container.go
+++ b/pkg/domain/entities/engine_container.go
@@ -95,7 +95,7 @@ type ContainerEngine interface { //nolint:interfacebloat
PodUnpause(ctx context.Context, namesOrIds []string, options PodunpauseOptions) ([]*PodUnpauseReport, error)
Renumber(ctx context.Context) error
Reset(ctx context.Context) error
- SetupRootless(ctx context.Context, noMoveProcess bool) error
+ SetupRootless(ctx context.Context, noMoveProcess bool, cgroupMode string) error
SecretCreate(ctx context.Context, name string, reader io.Reader, options SecretCreateOptions) (*SecretCreateReport, error)
SecretInspect(ctx context.Context, nameOrIDs []string, options SecretInspectOptions) ([]*SecretInfoReport, []error, error)
SecretList(ctx context.Context, opts SecretListRequest) ([]*SecretInfoReport, error)
diff --git a/pkg/domain/infra/abi/system_freebsd.go b/pkg/domain/infra/abi/system_freebsd.go
index c6ec91943..1521a7e1a 100644
--- a/pkg/domain/infra/abi/system_freebsd.go
+++ b/pkg/domain/infra/abi/system_freebsd.go
@@ -8,6 +8,6 @@ import (
const defaultRunPath = "/var/run"
// SetupRootless in a NOP for freebsd as it only configures the rootless userns on linux.
-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
return nil
}
diff --git a/pkg/domain/infra/abi/system_linux.go b/pkg/domain/infra/abi/system_linux.go
index abe00d89a..baaef396b 100644
--- a/pkg/domain/infra/abi/system_linux.go
+++ b/pkg/domain/infra/abi/system_linux.go
@@ -17,7 +17,7 @@ import (
// Default path for system runtime state
const defaultRunPath = "/run"
-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
runsUnderSystemd := systemd.RunsOnSystemd()
if !runsUnderSystemd {
isPid1 := os.Getpid() == 1
@@ -30,30 +30,33 @@ func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool)
}
}
- // do it only after podman has already re-execed and running with uid==0.
- hasCapSysAdmin, err := unshare.HasCapSysAdmin()
- if err != nil {
- return err
- }
- // check for both euid == 0 and CAP_SYS_ADMIN because we may be running in a container with CAP_SYS_ADMIN set.
- if os.Geteuid() == 0 && hasCapSysAdmin {
- ownsCgroup, err := cgroups.UserOwnsCurrentSystemdCgroup()
+ configureCgroup := cgroupMode != "disabled" && cgroupMode != "split"
+ if configureCgroup {
+ // do it only after podman has already re-execed and running with uid==0.
+ hasCapSysAdmin, err := unshare.HasCapSysAdmin()
if err != nil {
- logrus.Infof("Failed to detect the owner for the current cgroup: %v", err)
+ return err
}
- if !ownsCgroup {
- conf, err := ic.Config(context.Background())
+ // check for both euid == 0 and CAP_SYS_ADMIN because we may be running in a container with CAP_SYS_ADMIN set.
+ if os.Geteuid() == 0 && hasCapSysAdmin {
+ ownsCgroup, err := cgroups.UserOwnsCurrentSystemdCgroup()
if err != nil {
- return err
+ logrus.Infof("Failed to detect the owner for the current cgroup: %v", err)
}
- unitName := fmt.Sprintf("podman-%d.scope", os.Getpid())
- if runsUnderSystemd || conf.Engine.CgroupManager == config.SystemdCgroupsManager {
- if err := systemd.RunUnderSystemdScope(os.Getpid(), "user.slice", unitName); err != nil {
- logrus.Debugf("Failed to add podman to systemd sandbox cgroup: %v", err)
+ if !ownsCgroup {
+ conf, err := ic.Config(context.Background())
+ if err != nil {
+ return err
+ }
+ unitName := fmt.Sprintf("podman-%d.scope", os.Getpid())
+ if runsUnderSystemd || conf.Engine.CgroupManager == config.SystemdCgroupsManager {
+ if err := systemd.RunUnderSystemdScope(os.Getpid(), "user.slice", unitName); err != nil {
+ logrus.Debugf("Failed to add podman to systemd sandbox cgroup: %v", err)
+ }
}
}
+ return nil
}
- return nil
}
pausePidPath, err := util.GetRootlessPauseProcessPidPath()
diff --git a/pkg/domain/infra/tunnel/system.go b/pkg/domain/infra/tunnel/system.go
index 492fd0a89..f091fc79c 100644
--- a/pkg/domain/infra/tunnel/system.go
+++ b/pkg/domain/infra/tunnel/system.go
@@ -13,7 +13,7 @@ func (ic *ContainerEngine) Info(ctx context.Context) (*define.Info, error) {
return system.Info(ic.ClientCtx, nil)
}
-func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool) error {
+func (ic *ContainerEngine) SetupRootless(_ context.Context, noMoveProcess bool, cgroupMode string) error {
panic(errors.New("rootless engine mode is not supported when tunneling"))
}
This just leads to crashes:
$ podman ps
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1766a2a]
goroutine 1 [running]:
main.persistentPreRunE(0x2d89da0, {0x2eb50c0, 0x0, 0x0})
github.com/containers/podman/cmd/podman/root.go:366 +0xaaa
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).execute(0x2d89da0, {0xc0000400b0, 0x0, 0x0})
github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:954 +0x951
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x2d68ca0)
github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).Execute(...)
github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1039
github.com/containers/podman/vendor/github.com/spf13/cobra.(*Command).ExecuteContext(...)
github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1032
main.Execute()
github.com/containers/podman/cmd/podman/root.go:115 +0xb4
main.main()
github.com/containers/podman/cmd/podman/main.go:60 +0x452
(cmd/podman/root.go:366
is cgroupMode := cmd.LocalFlags().Lookup("cgroups").Value.String()
)
thanks for trying it.
Fixed that issue and opened a PR:
Issue Description
Running a rootless container with
--cgroups=disabled
still creates a new cgroup and runs the container in it.The issue can be reproduced in Podman 4.7.2 and 4.8.0 on Debian and Fedora.
Steps to reproduce the issue
Steps to reproduce the issue
podman run --rm --network=none --cgroups=disabled --cgroupns=host docker.io/library/debian:testing-20231120 sh -c 'cat /proc/$$/cgroup'
Describe the results you received
The cgroup is different from that of the process that invoked
podman run
:Describe the results you expected
The cgroup is the same as that of the process that invoked
podman run
.podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
No
Additional environment details
No response
Additional information
No response