bedrocklinux / bedrocklinux-userland

This tracks development for the things such as scripts and (defaults for) config files for Bedrock Linux
https://bedrocklinux.org
GNU General Public License v2.0
612 stars 66 forks source link

Bubblewrap not working in strata other than init #245

Open ethan2-0 opened 2 years ago

ethan2-0 commented 2 years ago

Bubblewrap errors when run in a stratum other than the stratum that provides init. Error message is at the bottom of the steps to reproduce. It's not clear to me why this is happening, though looking at the output of strace, it seems I'm getting EPERM on a clone syscall with flags=CLONE_NEWNS|CLONE_NEWUSER|SIGCHLD, which makes sense given the error message.

To reproduce:

In my case, my init strat is named debian, using Debian stable, and I've also created test-strat, also Debian stable. Both have bubblewrap installed.

$ brl which
debian
$ brl deref init
debian
$ bwrap --version
bubblewrap 0.4.1
$ brl version
Bedrock Linux 0.7.24 Poki
$ bwrap --dev-bind / / echo hi
hi
$ strat test-strat
$ bwrap --dev-bind / / echo hi
bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
cptpcrd commented 2 years ago

Hmm, a quick look at the clone(2) man page shows this:

EPERM (since Linux 3.9) CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller's root directory does not match the root directory of the mount namespace in which it resides).

This appears to be a security feature, likely to prevent unprivileged users from using user namespaces to escape chroots. Since Bedrock uses chroots to run non-init strata, this prevents creating user namespaces outside of the init strata.

This check also doesn't appear to have any exceptions. It might be possible to work around it by creating a new mount namespace when switching strata, but 1) I'm not 100% sure that would work and 2) if Bedrock did that by default, it would probably break things.

paradigm commented 2 years ago

I agree that clone(2) EPERM item is likely the culprit. I also agree that per-stratum mount namespaces would likely fix this issue.

In the immediate future, work-arounds include:

  1. Manually running chmod u+s /path/to/bwrap as root. AFAIK bwrap is designed to be run as setuid in case the kernel has non-privileged user namespaces disabled. However, I certainly understand unnecessary setuid being undesirable.
  2. Pairing init with bwrap. This constraint is undesirable as well.

Bedrock 0.7.x relies on the common mount namespace pervasively. A ready example is brl which which compares PID 1's mount table against another PID's to determine which stratum provides the second PID. I don't think a quick fix via a point update is viable.

The design of the upcoming 0.8.x is still somewhat fluid. I can try to incorporate per-stratum mount namespaces into it, although I can't make any promises. Off the top of my head it may introduce some design regressions:

While in principle having bwrap from any stratum just work is certainly desirable, it's not obvious to me if these trade-offs are worthwhile. It's also not obvious to me that it's not. I'll need to think about it.

paradigm commented 2 years ago

While not the main focus of this issue, I should point out that neither querying for the current shell then running a command like so:

$ brl which
debian
[...]
$ bwrap --dev-bind / / echo hi
hi

nor specifying a shell then running a command like so:

$ strat test-strat
$ bwrap --dev-bind / / echo hi
bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.

are guaranteed to get the command from the shell's stratum. Consider what happens in both those cases if bwrap is installed in a third stratum and not in either of those strata. Another example which may be easier to think about is:

$ strat debian
$ brl which
debian
$ grep "^NAME" /etc/os-release
NAME="Debian GNU/Linux"
$ pacman --help | head -n1
usage:  pacman <operation> [...]

Keep in mind that, despite the discussion around namespaces and chroot, Bedrock is not containers.

Rather, I recommend either querying specifically about the command in question (rather than the shell):

$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
debian
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat
$ brl which bwrap # if I run `bwrap` in this context, which stratum provides it?
test-strat
$ bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]

or just explicitly specifying which stratum's instance is desired:

$ strat debian bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
$ strat test-strat bwrap | head -n1
usage: bwrap [OPTIONS...] [--] COMMAND [ARGS...]
paradigm commented 2 years ago

I've spent some time exploring the possibility of per-stratum mount namespaces. I think I've confirmed that this fixes the issue in a local hacky test. I also think I've found a way forward.

Enabling a stratum should:

In some hacky tests, I think I've confirmed we can do this by:

We also need some system to track the mount namespaces, associate them with strata, and a way for strat to setns the correct namespace. I came up with three possibilities to pursue here:

Sadly this retains most of the regressions I was worried about earlier; I couldn't find ways around them. However, as a bonus, it might improve the system shutdown experience. I think the kernel automatically handles unmounting mount points in namespaces with no processes/tracking-mounts/file-descriptors (https://unix.stackexchange.com/questions/212172/what-happens-if-the-last-process-in-a-namespace-exits). This will likely both slightly improve shutdown time and resolve this issue.

paradigm commented 2 years ago

After thinking about this even more, I think per-stratum namespaces would also help with:

The trade-off seems more and more in favor of per-stratum namespaces. I'm going to start planning a big refactor of the 0.8 efforts in this direction.