Closed felixfontein closed 1 year ago
@giuseppe PTAL
I don't think it blocked by seccomp, your reproducer still fails with --security-opt seccomp=unconfined
.
It works however with --cap-add cap_sys_admin, I don't think this something we should give containers by default.
This is doing a clone() with CLONE_NEWNS, that shouldn't require CAP_SYS_ADMIN no? Assuming it's running in a user namespace
the seccomp profile is not blocking clone()
as @Luap99 already pointed out.
I think the failure is related to CLONE_NEWNS needs CAP_SYS_ADMIN, either granted from the host or gained in a user namespace.
closing the issue since I think there is nothing to change in the seccomp profile
Yes I believe this should be opened as a bug with systemd (or the kernel). I was able to reproduce this.
A workaround was already merged, but I'm more interested in finding a working solution - if we fork and create a user ns together with a mount ns, would that be allowed to work then? This is ran by the same uid as pid1 in the container, which I assume is uid0 inside the container, which I assume is already in a user ns?
I think that will work, I've done a quick test:
this is the problem you're seeing in systemd:
$ podman run --rm fedora sh -c 'mount -t tmpfs tmpfs /tmp; grep ^Cap /proc/self/status'
mount: /tmp: permission denied.
dmesg(1) may have more information after failed mount system call.
CapInh: 0000000000000000
CapPrm: 00000000800405fb
CapEff: 00000000800405fb
CapBnd: 00000000800405fb
CapAmb: 0000000000000000
with a new user+mount namespace:
$ podman run --rm fedora unshare -rm sh -c 'mount -t tmpfs tmpfs /tmp; grep ^Cap /proc/self/status'
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
Thanks, will give that a shot
Right now, podman cannot start a container with systemd 253's init as the main process. This has been reported in https://bugzilla.redhat.com/show_bug.cgi?id=2165004 and https://github.com/systemd/systemd/issues/26474, and is apparently caused by podman's seccomp rules together with systemd 253 not having a graceful fallback. (https://github.com/systemd/systemd/pull/26478/commits/40389cf0804f77ec9b56847bc016eff9473459ae fixes the problem, which will probably help to figure out the adjustments that are needed.)
I hope this is the correct place to file this issue, since the default rules seem to be in this repository: https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json