Open adrelanos opened 7 years ago
Looks like the combination of --unshare-user
, --unshare-pid
, and --proc /proc
is causing this. Test case:
bwrap --ro-bind / / --unshare-user --unshare-pid --proc /proc /bin/bash
If I remove any of those options, /bin/bash
is started. Otherwise, it throws an error:
Can't mount proc on /newroot/proc: Operation not permitted
Running with strace
doesn't say much more - indeed mount
syscall fails with EPERM:
mount("proc", "/newroot/proc", "proc", MS_MGC_VAL|MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
Any idea?
So, the kernel disallows mounting proc in the user + pid namespace. That is weird. Clearly it has mount capabilieites, because earlier mounts succeeded.
In the upstream kernel, procfs has:
static struct file_system_type proc_fs_type = {
.name = "proc",
.mount = proc_mount,
.kill_sb = proc_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
};
This flag (FS_USERNS_MOUNT) should allow mounting a new proc instance in a user namespace. Does the qubes kernel change this in any way?
And anyway, the debian build of bubblewrap uses setuid, so it should have capabilities in the parent namespace too. Very weird.
Does qubes itself use namespaces?
This flag (FS_USERNS_MOUNT) should allow mounting a new proc instance in a user namespace. Does the qubes kernel change this in any way?
No, the kernel is very close to upstream one. Also we don't use namespaces... It's really strange.
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
I wonder if its related to this: https://lwn.net/Articles/644932/
I.e. maybe your /proc
has some mount flag, or some covering mount.
How does your /proc/self/mounts
look?
How does your
/proc/self/mounts
look?
sudo cat /proc/self/mounts
/dev/mapper/dmroot / ext4 rw,noatime,data=ordered 0 0
/dev/xvdd /lib/modules/4.4.31-11.pvops.qubes.x86_64 ext3 ro,relatime,data=ordered 0 0
sysfs /sys sysfs rw,relatime 0 0
proc /proc proc rw,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=149600k,nr_inodes=37400,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,size=1048576k,nr_inodes=39133 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,size=156532k,nr_inodes=39133,mode=755 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,nr_inodes=39133 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=156532k,nr_inodes=39133,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
tmpfs /tmp tmpfs rw,size=1048576k,nr_inodes=39133 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
xen /proc/xen xenfs rw,relatime 0 0
/dev/xvdb /rw ext4 rw,relatime,discard,data=ordered 0 0
/dev/xvdb /home ext4 rw,relatime,discard,data=ordered 0 0
/dev/xvdb /var/spool/cron ext4 rw,relatime,discard,data=ordered 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=31308k,nr_inodes=39134,mode=700,uid=1000,gid=1000 0 0
I don't have a xen build, but reading the code it seems this is the problem:
xen /proc/xen xenfs rw,relatime 0 0
This is created if you have the XEN_COMPAT_XENFS
config on in the kernel, and it is created by:
proc_mkdir("xen", NULL);
However, as far as I can see in the kernel that isn't enough to make it realize that this is an "empty" directory, and thus the /proc/xen mount is not covering anything. It should really call proc_create_mount_point("xen") for this to work.
Can you try disabling that kernel config option? (or fixing the mountpoint as per the above).
Indeed after unmounting /proc/xen
it does work. I wonder if anything still use /proc/xen
in Qubes... AFAIR it's legacy location and the new one is /dev/xen
. There were more problems with /proc/xen
(where "normal files" behaves like character devices...). The fact that I could unmount it without killing anything suggests it isn't used anymore :)
If you want to be conservative, it might work to add a patch to bwrap to unmount it?
(Just in the new mount namespace)
That sounds awesome! Please do!
No, we can't unmount it. Thats the problem essentially. If /foo and /foo/bar are mountpoints when we create an unprivileged user namespace, then we get the two inherited as a unit, and we cannot unmount /foo/bar, because that may expose files under it that was not visible in the parent namespace. The same actually is true for mounting a new procfs instance, if /proc/foo was overmounted in the host, then we can't mount a fresh /proc, because we can see into foo where we couldn't before.
Of course in some cases we know it is safe, because foo is always empty, because the only reason its there is as a mountpoint. In such cases the kernel marks these directories as "always-empty", and mounts on top of them is not considered to cover anything, thus allowing a fresh proc to be mounted.
Changing proc_mkdir("xen", NULL) to proc_create_mount_point("xen") in the kernel would fix it, as the xen directory is then not considered covered.
@alexlarsson Can we take advantage of the fact that we are suid to forcibly unmount /proc/xen
in the child? That does mean hardcoding /proc/xen
, but I consider that safe.
The suid path isn't the future though. Based on comment https://github.com/projectatomic/bubblewrap/issues/134#issuecomment-271998694 it sounds like Qubes is going to disable the legacy mountpoint which should address this issue, right?
A quick git log -G proc.*mkdir.*xen
hits this commit which is in 4.10. So - anyone affected, upgrade your kernel.
@cgwalters bwrap is suid at least on my system, and it would be nice to use it to solve this problem.
Also apparently several legacy scripts in Quebes rely on /proc/xen
.
Also apparently several legacy scripts in Quebes rely on /proc/xen.
Not that many. There is only one thing that is still used from that - /proc/xen/capabilities
, to detect dom0. Once replaced, we can get rid of /proc/xen
mount.
This is still broken as of today in Qubes 3.2 with Fedora 27 template. Notably, it breaks video thumbnailing in Nautilus (and presumably other programs, whose video thumbnails do not show up):
[pid 8531] execve("/usr/bin/bwrap", ["bwrap", "--ro-bind", "/usr", "/usr", "--ro-bind", "/lib", "/lib", "--ro-bind", "/lib64", "/lib64", "--proc", "/proc", "--dev", "/dev", "--symlink", "usr/bin", "/bin", "--symlink", "usr/sbin", "/sbin", "--chdir", "/", "--setenv", "GIO_USE_VFS", "local", "--unshare-all", "--die-with-parent", "--bind", "/tmp/gnome-desktop-thumbnailer-0"..., "/tmp", "--ro-bind", "/home/user/sshfs/WhatsApp/Media/"..., ...], 0x58e7594331f0 /* 17 vars */ <unfinished ...>
[pid 3896] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 3896] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 3896] write(23, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 3948] write(23, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 3896] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 3948] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 8531] <... execve resumed> ) = 0
strace: Process 8532 attached
[pid 8531] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 8532] write(6, "0 1000 1\n", 9) = 9
[pid 8532] write(6, "deny\n", 5) = 5
[pid 8532] write(6, "0 1000 1\n", 9) = 9
[pid 8532] write(2, "bwrap: ", 7) = 7
[pid 8532] write(2, "Can't mount proc on /newroot/pro"..., 33) = 33
[pid 8532] write(2, ": Operation not permitted\n", 26) = 26
[pid 8532] +++ exited with 1 +++
[pid 8531] +++ exited with 1 +++
[pid 3947] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8531, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
Please fix this.
Using Qubes Debian
jessie
based AppVM with bubblewrap from jessie-backports (version0.1.4-2~bpo8+1
).(Neither AppArmor nor grsecurity is being involved.)
Here are instructions on how to reproduce this in Qubes: https://github.com/QubesOS/qubes-issues/issues/2540
A simple test
bwrap --ro-bind / / --proc /proc --dev /dev /bin/bash
worked for me.Outside of Qubes, i.e. in a Non-Qubes Debian jessie (VirtualBox) VM
sandboxed-tor-browser
works fine.So I guess "something that Qubes does breaks bubblewrap". Could you help us please making this more specific?
I've been advised to:
Do you know why this is happening? How to fix this? Want any debug output? If you like a rebuild how bubblewrap with debugging enabled, where do you find build instructions?