Open ethan2-0 opened 2 years ago
Hmm, a quick look at the clone(2)
man page shows this:
EPERM (since Linux 3.9) CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller's root directory does not match the root directory of the mount namespace in which it resides).
This appears to be a security feature, likely to prevent unprivileged users from using user namespaces to escape chroots. Since Bedrock uses chroots to run non-init strata, this prevents creating user namespaces outside of the init strata.
This check also doesn't appear to have any exceptions. It might be possible to work around it by creating a new mount namespace when switching strata, but 1) I'm not 100% sure that would work and 2) if Bedrock did that by default, it would probably break things.
I agree that clone(2)
EPERM
item is likely the culprit. I also agree that per-stratum mount namespaces would likely fix this issue.
In the immediate future, work-arounds include:
chmod u+s /path/to/bwrap
as root. AFAIK bwrap
is designed to be run as setuid
in case the kernel has non-privileged user namespaces disabled. However, I certainly understand unnecessary setuid
being undesirable.bwrap
. This constraint is undesirable as well.Bedrock 0.7.x relies on the common mount namespace pervasively. A ready example is brl which
which compares PID 1's mount table against another PID's to determine which stratum provides the second PID. I don't think a quick fix via a point update is viable.
The design of the upcoming 0.8.x is still somewhat fluid. I can try to incorporate per-stratum mount namespaces into it, although I can't make any promises. Off the top of my head it may introduce some design regressions:
strat
currently only requires CAP_SYS_CHROOT
. This will likely require strat
be full blown setuid
.strat
performance hit? I think it would have to do at least one more system call.setns()
.bedrock.conf
's share =
lines) will probably require a reboot. AFAIK it's not possible to create new bind mounts across mount namespaces outside of a shared subtree mount.brl which -p
that don't require either setuid or querying some root process, neither of which are desirable. Unprivileged brl which -p
support may be dropped.~ If we go with a per-stratum brld
thread, that both thread and the pid in question will have the same publicly-readable mountinfo
specific to the namespace. We could do something like compare awk '$5 == "/" {print$1;exit}' /proc/.../mountinfo
output.While in principle having bwrap
from any stratum just work is certainly desirable, it's not obvious to me if these trade-offs are worthwhile. It's also not obvious to me that it's not. I'll need to think about it.
While not the main focus of this issue, I should point out that neither querying for the current shell then running a command like so:
$ brl which debian [...] $ bwrap --dev-bind / / echo hi hi
nor specifying a shell then running a command like so:
$ strat test-strat $ bwrap --dev-bind / / echo hi bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
are guaranteed to get the command from the shell's stratum. Consider what happens in both those cases if bwrap
is installed in a third stratum and not in either of those strata. Another example which may be easier to think about is:
$ strat debian
$ brl which
debian
$ grep "^NAME" /etc/os-release
NAME="Debian GNU/Linux"
$ pacman --help | head -n1
usage: pacman <operation> [...]
Keep in mind that, despite the discussion around namespaces and chroot, Bedrock is not containers.
Rather, I recommend either querying specifically about the command in question (rather than the shell):
$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
debian
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat
$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
test-strat
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
or just explicitly specifying which stratum's instance is desired:
$ strat debian bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
I've spent some time exploring the possibility of per-stratum mount namespaces. I think I've confirmed that this fixes the issue in a local hacky test. I also think I've found a way forward.
Enabling a stratum should:
In some hacky tests, I think I've confirmed we can do this by:
/etc
, which is implemented via shared mounts, will require restarting all strata. Since we can't restart the init stratum on-the-fly without crashing the system, this will effectively require a reboot./bedrock/strata
will be one of these shared/global mounts by default./bedrock/strata/<stratum-being-enabled>
/bedrock/strata
is shared makes me think this should work fine at this point./bedrock/strata/<stratum-being-enabled>
to some <new-root>
and ensure both the bind-mount and its parent mount are private mounts.
pivot_root
mount --move
all mounts of interest to <new-root>/<path>
.
man mount
says --move
one cannot move a mount residing under a shared mount. However, in my testing it with the above setup it does appear to work. If I confused a step and it doesn't actually work, we might be able to tower-of-hanoi things into place.pivot_root <new-root> <old-root>
<old-root>
We also need some system to track the mount namespaces, associate them with strata, and a way for strat
to setns
the correct namespace. I came up with three possibilities to pursue here:
strat
could then open
and setns
the /proc/<daemon-pid>/task/<stratum-thread-tid>/ns/mnt
paths.brl enable
/brl disable
could manage symlinks in some global location to associate the /proc/.../mnt
paths with stratum names.htop
, even if they utilize zero CPU cycles and very little memory.brl enable
creates the namespace and bind-mount its /proc/.../mnt
file to save the namspace.
man 7 namespaces
makes it seem like this should work, but I couldn't get it to; mount
kept giving me errors. Maybe some sort of loop prevention? See https://unix.stackexchange.com/questions/517234/why-can-i-not-bind-a-mount-namespace-to-a-file/proc/.../mnt
file, and tracks the file descriptor.
strat
could then communicate with the daemon via a socket to get the file descriptor to setns
.~ strat
can open/setns straight from /proc/<daemon-pid>/fd/<fd>
. The daemon can surface which of its file descriptors corresponds to which stratum mount namespace via a symlink to it through FUSE. FUSE can cache symlinks on the kernel side such that repeated rapid access is very fast./proc/<daemon-pid>/fd/<fd>
strat
performance regression from Poki.~ IPC overhead concern was with sockets; reading /proc/<daemon-pid>/fd/<fd>
resolves this.Sadly this retains most of the regressions I was worried about earlier; I couldn't find ways around them. However, as a bonus, it might improve the system shutdown experience. I think the kernel automatically handles unmounting mount points in namespaces with no processes/tracking-mounts/file-descriptors (https://unix.stackexchange.com/questions/212172/what-happens-if-the-last-process-in-a-namespace-exits). This will likely both slightly improve shutdown time and resolve this issue.
After thinking about this even more, I think per-stratum namespaces would also help with:
/boot
mount well. If we namespace it, it'll only see one.java
prints a warning about cpuset
mounts, which has surprised some users.The trade-off seems more and more in favor of per-stratum namespaces. I'm going to start planning a big refactor of the 0.8 efforts in this direction.
Bubblewrap errors when run in a stratum other than the stratum that provides init. Error message is at the bottom of the steps to reproduce. It's not clear to me why this is happening, though looking at the output of
strace
, it seems I'm gettingEPERM
on aclone
syscall withflags=CLONE_NEWNS|CLONE_NEWUSER|SIGCHLD
, which makes sense given the error message.To reproduce:
In my case, my init strat is named
debian
, using Debian stable, and I've also createdtest-strat
, also Debian stable. Both havebubblewrap
installed.