containers / common

Location for shared common files in github.com/containers repos.
Apache License 2.0
183 stars 191 forks source link

seccomp rules need to be adjusted for systemd 253 #1337

Closed felixfontein closed 1 year ago

felixfontein commented 1 year ago

Right now, podman cannot start a container with systemd 253's init as the main process. This has been reported in https://bugzilla.redhat.com/show_bug.cgi?id=2165004 and https://github.com/systemd/systemd/issues/26474, and is apparently caused by podman's seccomp rules together with systemd 253 not having a graceful fallback. (https://github.com/systemd/systemd/pull/26478/commits/40389cf0804f77ec9b56847bc016eff9473459ae fixes the problem, which will probably help to figure out the adjustments that are needed.)

I hope this is the correct place to file this issue, since the default rules seem to be in this repository: https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Luap99 commented 1 year ago

@giuseppe PTAL

Luap99 commented 1 year ago

I don't think it blocked by seccomp, your reproducer still fails with --security-opt seccomp=unconfined.

It works however with --cap-add cap_sys_admin, I don't think this something we should give containers by default.

bluca commented 1 year ago

This is doing a clone() with CLONE_NEWNS, that shouldn't require CAP_SYS_ADMIN no? Assuming it's running in a user namespace

giuseppe commented 1 year ago

the seccomp profile is not blocking clone() as @Luap99 already pointed out.

I think the failure is related to CLONE_NEWNS needs CAP_SYS_ADMIN, either granted from the host or gained in a user namespace.

giuseppe commented 1 year ago

closing the issue since I think there is nothing to change in the seccomp profile

rhatdan commented 1 year ago

Yes I believe this should be opened as a bug with systemd (or the kernel). I was able to reproduce this.

bluca commented 1 year ago

A workaround was already merged, but I'm more interested in finding a working solution - if we fork and create a user ns together with a mount ns, would that be allowed to work then? This is ran by the same uid as pid1 in the container, which I assume is uid0 inside the container, which I assume is already in a user ns?

giuseppe commented 1 year ago

I think that will work, I've done a quick test:

this is the problem you're seeing in systemd:

$ podman run --rm fedora sh -c 'mount -t tmpfs tmpfs /tmp; grep ^Cap /proc/self/status'
mount: /tmp: permission denied.
       dmesg(1) may have more information after failed mount system call.
CapInh: 0000000000000000
CapPrm: 00000000800405fb
CapEff: 00000000800405fb
CapBnd: 00000000800405fb
CapAmb: 0000000000000000

with a new user+mount namespace:

$ podman run --rm fedora unshare -rm sh -c 'mount -t tmpfs tmpfs /tmp; grep ^Cap /proc/self/status'
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
bluca commented 1 year ago

Thanks, will give that a shot