Closed flyn-org closed 3 years ago
Somewhat similar to https://github.com/containers/podman/issues/9219.
With the mount proc issue, are you running podman within podman?
You could try to add --security-opt=umask=/proc/*
and see if this helps.
Most of the maintainers here are probably not familiar with OpenWRT (I can say for certain I am not). More details on the environment Podman is being run in will help if you can give them, particularly ways it differs from a standard Linux distro
OpenWrt is a distribution that aims to provide firmware for routers and "small" network devices. One way that it differs from most other distributions is that it does not use systemd. OpenWrt relies on busybox, but things like shadow-utils are available too.
OpenWrt is a bit niche, and I do not expect most people would have a deep understanding of it. This is why I expect to do most of the engineering to get this to work. For example, if there is a privileged agent that is missing, then I would be happy to package it for OpenWrt or write a replacement. As I indicated, I have already made some progress with this type of work.
What I am really looking for is an explanation of how some parts of rootless podman work. (I also welcome suggestions for further reading.) My two questions above highlight what I think are the gaps in my knowledge, namely (1) how does non-root mount arbitrary things when mounting arbitrary things usually requires root access? and (2) how does non-root manipulate /sys/fs/cgroup/* when the permissions on those pseudo-files seem to prohibit write access to non-root?
Focusing on point (1), running strace -f podman run containers
produces the following when run under strace:
[...]
[pid 2524] mount("proc", "/proc/self/fd/7", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL <unfinished ...>
[pid 2518] read(6, <unfinished ...>
[pid 2524] <... mount resumed>) = -1 EPERM (Operation not permitted)
[...]
Working through the surrounding evidence, I think PID 2524 is crun as executed by podman. I also think the above system calls take place in src/libcrun/linux.c's do_mount(). To be clear, I am not asking why this fails on my system (OpenWrt), I am asking why it succeeds on yours! What special circumstance allows crun, running as non-root, to mount proc on RHEL, CentOS, Fedora, Ubuntu, and so on?
For the first one - we use a user namespace to allow (limited) access to the mount syscall. The kernel has been patched to allow non-root users the ability to mount a few types of filesystem within a user namespace, most notably tmpfs and fuse, with full overlayfs being available in very recent kernels. Very old kernels (I don't have a precise version, but I think it was 2017-ish that the patch allowing it landed) may not have FUSE in user namespace support, which is a problem for Podman.
For cgroups - we should not be doing anything with cgroups as rootless on cgroupsv1 systems. I believe we attempt to do a few things as holdovers from root, some of those printing warnings, but none of these are essential.
an unprivileged user can mount proc if it is in a user namespace and it owns the mount and the pid namespace. An additional requirement from the kernel is that there is already a procfs
file system fully visible and already mounted.
Can you show the list of mounts on your system (cat /proc/self/mountinfo
)?
@mheon, it seems that podman is trying to mount proc without the use of fuse-overlayfs, right? At least this is how I interpret the strace fragment above. Is this expected?
@giuseppe, proc is mounted elsewhere; see below. When you say, "if it is in a user namespace and it owns the mount and the pid namespace," how can I determine if podman has satisfied these conditions?
OpenWrt uses the musl C library. Could that cause a problem?
$ cat /proc/self/mountinfo
13 1 8:2 / / rw,noatime - ext4 /dev/root rw
14 13 0:5 / /proc rw,nosuid,nodev,noexec,noatime - proc proc rw
15 13 0:14 / /sys rw,nosuid,nodev,noexec,noatime - sysfs sysfs rw
16 15 0:15 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate
19 13 0:18 / /tmp rw,nosuid,nodev,noatime - tmpfs tmpfs rw
20 13 8:1 / /boot rw,noatime - ext4 /dev/sda1 rw
21 20 8:1 /boot /boot rw,noatime - ext4 /dev/sda1 rw
18 13 0:17 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,size=512k,mode=755
22 18 0:19 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=000
17 15 0:6 / /sys/kernel/debug rw,noatime - debugfs debugfs rw
23 15 0:16 / /sys/fs/bpf rw,nosuid,nodev,noexec,noatime - bpf none rw,mode=700
24 19 0:18 /lib/containers/storage/overlay /tmp/lib/containers/storage/overlay rw,nosuid,nodev,noatime - tmpfs tmpfs rw
$ uname -a
Linux aquinas-user 5.4.124 #0 SMP Sat Jun 19 10:17:09 2021 x86_64 GNU/Linux
Perhaps something is missing from the kernel configuration?
$ grep _NS= ./build_dir/target-x86_64_musl/linux-x86_64/linux-5.4.124/.config
CONFIG_IPC_NS=y
CONFIG_NET_NS=y
CONFIG_PID_NS=y
CONFIG_USER_NS=y
CONFIG_UTS_NS=y
grep CGROUP ./build_dir/target-x86_64_musl/linux-x86_64/linux-5.4.124/.config
CONFIG_BLK_CGROUP=y
# CONFIG_BLK_CGROUP_IOCOST is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_DEVICE is not set
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CGROUP_NET_CLASSID is not set
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_WRITEBACK=y
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NET_CLS_CGROUP is not set
CONFIG_SOCK_CGROUP_DATA=y
please try the following command: unshare -r -m -f -p --mount-proc=/proc echo it works
Does it work for you?
The command you suggested fails, even when run as root:
# unshare -r -m -f -p --mount-proc=/proc echo it works
unshare: mount /proc failed: Operation not permitted
Here is a portion of the output from strace -f unshare ...
:
mprotect(0x7fe04829d000, 4096, PROT_READ) = 0
mprotect(0x7fe04830f000, 4096, PROT_READ) = 0
mprotect(0x408000, 4096, PROT_READ) = 0
geteuid() = 0
getegid() = 0
unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWPID) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fe0482e7286}, {sa_handler=SIG_DFL, sa_mask=[], sa_flag0
rt_sigaction(SIGTERM, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fe0482e7286}, {sa_handler=SIG_DFL, sa_mask=[], sa_fla0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
fork() = 2429
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
wait4(2429, strace: Process 2429 attached
<unfinished ...>
[pid 2429] gettid() = 1
[pid 2429] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 2429] open("/proc/self/uid_map", O_WRONLY) = 3
[pid 2429] write(3, "0 0 1", 5) = 5
[pid 2429] close(3) = 0
[pid 2429] open("/proc/self/setgroups", O_WRONLY) = 3
[pid 2429] write(3, "deny", 4) = 4
[pid 2429] close(3) = 0
[pid 2429] open("/proc/self/gid_map", O_WRONLY) = 3
[pid 2429] write(3, "0 0 1", 5) = 5
[pid 2429] close(3) = 0
[pid 2429] mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL) = 0
[pid 2429] mount("none", "/proc", NULL, MS_REC|MS_PRIVATE, NULL) = 0
[pid 2429] mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
I am able to mount procfs elsewhere as root:
# mount proc /mnt/ -t proc -o nosuid,nodev,noexec
# mount
/dev/root on / type ext4 (rw,noatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,noatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,noatime)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noatime)
/dev/sda1 on /boot type ext4 (rw,noatime)
/dev/sda1 on /boot type ext4 (rw,noatime)
tmpfs on /dev type tmpfs (rw,nosuid,relatime,size=512k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,mode=600,ptmxmode=000)
debugfs on /sys/kernel/debug type debugfs (rw,noatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,noatime,mode=700)
tmpfs on /tmp/lib/containers/storage/overlay type tmpfs (rw,nosuid,nodev,noatime)
proc on /mnt type proc (rw,nosuid,nodev,noexec,relatime)
My build of OpenWrt is not running SELinux or anything else I can think of that would restrict processes beyond the standard Unix permissions model. Do the kernel configuration fragments in my earlier comment shed any light? Is there anything else that I need to activate when I build my kernel?
I found there're some diffs between normal debian and openwrt as below:
openwrt had no cgroup.procs as debian did
buildman@debian:~$ ls /sys/fs/cgroup/unified/cgroup.procs
/sys/fs/cgroup/unified/cgroup.procs
debian mount cgroup2 on "/sys/fs/cgroup/unified", openwrt mount cgroup2 on "/sys/fs/cgroup". would this matter anything?
debian had xxx.slice and init.scope folders, openwrt not. openwrt had only one folder, "service" that contained dropbear the ssh service only.
debian
buildman@debian:~$ ls /sys/fs/cgroup/unified/
cgroup.controllers cgroup.max.descendants cgroup.stat cgroup.threads init.scope/ user.slice/
cgroup.max.depth cgroup.procs cgroup.subtree_control cpu.stat system.slice/
openwrt
root@hp ~# ls /sys/fs/cgroup/
cgroup.controllers cgroup.procs cgroup.threads cpuset.mems.effective
cgroup.max.depth cgroup.stat cpu.stat io.stat
cgroup.max.descendants cgroup.subtree_control cpuset.cpus.effective services/
A friendly reminder that this issue had no activity for 30 days.
@giuseppe Any update?
if unshare -r -m -f -p --mount-proc=/proc echo it works
fails as root, then there is something out of our control.
Can you please share the output of stat /proc
and findmnt -o OPTIONS /proc
?
Does it make any difference if you try to umount /proc
first?:
# unshare -r -m -f -p bash
# umount /proc
# mount -t proc proc /proc
root@host:~# stat /proc
File: /proc
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 5h/5d Inode: 1 Links: 115
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-07-20 23:01:12.050586116 -0500
Modify: 2021-07-20 23:01:12.050586116 -0500
Change: 2021-07-20 23:01:12.050586116 -0500
Birth: -
root@host:~# findmnt -o OPTIONS /proc
OPTIONS
rw,nosuid,nodev,noexec,noatime
root@host:~# unshare -r -m -f -p ash
BusyBox v1.33.1 (2021-07-19 17:10:18 UTC) built-in shell (ash)
root@host:~# umount /proc
umount: /proc: not mounted.
root@host:~# mount -t proc proc /proc
mount: /proc: permission denied.
thanks, what do you see with sudo findmnt -R -o TARGET,PROPAGATION /
on the host?
If /proc
is mounted as private, then its mount is not propagated into the new mount namespace and the procfs mount fails since the kernel expects a procfs to be already visible in the mount namespace (when running in a user namespace) before allowing a new procfs mount.
It looks like /proc is private. How does one mount /proc in a way that is not private? This is new to me. The mount
manpage mentions "Shared subtree operations"; is this relevant? What start-up component ensures the proper mounting of these things on, say, Fedora or Ubuntu?
root@host:~# findmnt -R -o TARGET,PROPAGATION /
TARGET PROPAGATION
/ private
├─/proc private
│ └─/proc/xen private
├─/sys private
│ ├─/sys/fs/cgroup private
│ ├─/sys/kernel/debug private
│ └─/sys/fs/bpf private
├─/dev private
│ └─/dev/pts private
├─/tmp private
│ └─/tmp/lib/containers/storage/overlay private
└─/boot private
└─/boot private
I did try to run mount --make-shared /proc
, mount --make-shared /proc/xen
, and then unshare -r -m -f -p --mount-proc=/proc echo it works
. But, the unshare
command still said "unshare: mount /proc failed: Operation not permitted."
can you try unmounting /proc/xen
? That makes the /proc
mount not fully visible, and then the kernel refuses to mount a fresh procfs if there is not already one fully visible.
root@host:~# umount /proc/xen/
root@host:~# mount --make-shared /proc
root@host:~# findmnt -R -o TARGET,PROPAGATION /
TARGET PROPAGATION
/ private
├─/proc shared
├─/sys private
│ ├─/sys/fs/cgroup private
│ ├─/sys/kernel/debug private
│ └─/sys/fs/bpf private
├─/dev private
│ └─/dev/pts private
├─/tmp private
│ └─/tmp/lib/containers/storage/overlay private
└─/boot private
└─/boot private
root@host:~# unshare -r -m -f -p --mount-proc=/proc echo it works
unshare: mount /proc failed: Operation not permitted
that is strange, the mount of /proc
should not fail now as it is fully visible.
What do you see with unshare -r -m -f -p cat /proc/self/mountinfo
?
I ran the command you requested with /proc private and then shared:
root@host:~# findmnt -R -o TARGET,PROPAGATION /
TARGET PROPAGATION
/ private
├─/proc private
├─/sys private
│ ├─/sys/fs/cgroup private
│ ├─/sys/kernel/debug private
│ └─/sys/fs/bpf private
├─/dev private
│ └─/dev/pts private
├─/tmp private
│ └─/tmp/lib/containers/storage/overlay private
└─/boot private
└─/boot private
root@host:~# unshare -r -m -f -p cat /proc/self/mountinfo
26 19 202:2 / / rw,noatime - ext4 /dev/root rw
27 26 0:5 / /proc rw,nosuid,nodev,noexec,noatime - proc proc rw
28 26 0:13 / /sys rw,nosuid,nodev,noexec,noatime - sysfs sysfs rw
29 28 0:14 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate
30 28 0:6 / /sys/kernel/debug rw,noatime - debugfs debugfs rw
31 28 0:20 / /sys/fs/bpf rw,nosuid,nodev,noexec,noatime - bpf none rw,mode=700
32 26 0:17 / /tmp rw,nosuid,nodev,noatime - tmpfs tmpfs rw
33 32 0:17 /lib/containers/storage/overlay /tmp/lib/containers/storage/overlay rw,nosuid,nodev,noatime - tmpfs tmpfs rw
34 26 202:1 / /boot rw,noatime - ext4 /dev/xvda1 rw
35 34 202:1 /boot /boot rw,noatime - ext4 /dev/xvda1 rw
36 26 0:16 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,size=512k,mode=755
37 36 0:19 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=000
root@host:~# mount --make-shared /proc
root@host:~# findmnt -R -o TARGET,PROPAGATION /
TARGET PROPAGATION
/ private
├─/proc shared
├─/sys private
│ ├─/sys/fs/cgroup private
│ ├─/sys/kernel/debug private
│ └─/sys/fs/bpf private
├─/dev private
│ └─/dev/pts private
├─/tmp private
│ └─/tmp/lib/containers/storage/overlay private
└─/boot private
└─/boot private
root@host:~# unshare -r -m -f -p cat /proc/self/mountinfo
26 19 202:2 / / rw,noatime - ext4 /dev/root rw
27 26 0:5 / /proc rw,nosuid,nodev,noexec,noatime - proc proc rw
28 26 0:13 / /sys rw,nosuid,nodev,noexec,noatime - sysfs sysfs rw
29 28 0:14 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate
30 28 0:6 / /sys/kernel/debug rw,noatime - debugfs debugfs rw
31 28 0:20 / /sys/fs/bpf rw,nosuid,nodev,noexec,noatime - bpf none rw,mode=700
32 26 0:17 / /tmp rw,nosuid,nodev,noatime - tmpfs tmpfs rw
33 32 0:17 /lib/containers/storage/overlay /tmp/lib/containers/storage/overlay rw,nosuid,nodev,noatime - tmpfs tmpfs rw
34 26 202:1 / /boot rw,noatime - ext4 /dev/xvda1 rw
35 34 202:1 /boot /boot rw,noatime - ext4 /dev/xvda1 rw
36 26 0:16 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,size=512k,mode=755
37 36 0:19 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=000
root@host:~# unshare -r -m -f -p --mount-proc=/proc echo it works
unshare: mount /proc failed: Operation not permitted
Thanks. I can't spot anything wrong in your configuration now. It is something else blocking the mount. Do you see any error message in your system logs? Is the kernel using any custom patch?
@giuseppe, it might be the kernel. I have been trying to figure out the compile-time kernel settings that running podman as non-root user depends on. Can you suggest where I could find this information? For what it is worth, I went through a similar process as a part of the effort to bring SELinux to OpenWrt. (In this case, SELinux is not on.)
I am not aware of such documentation. Would it be possible to use the same source (no additional patches) and configuration that is used on other distros like Fedora?
A friendly reminder that this issue had no activity for 30 days.
Seems to be some activity on the OpenWrt end.
@rhatdan, we have indeed made progress on some of the surrounding bugs as documented in https://github.com/openwrt/packages/issues/15096. However, we have not yet figured out the (suspected) kernel differences that prevent mounting /proc within containers on OpenWrt as documented above.
Feel free to add more comments, I am closing this issue since there is nothing we can do in Podman.
/kind feature
I am trying to modify OpenWrt and its podman package to allow users other than root to manage containers on that system. @rhatdan suggested I create a GitHub issue after I brought this up on the podman mailing list.
I have made some progress, including working through some "bugs" in podman and the OpenWrt packages:
A summary of my work so far exists at https://github.com/openwrt/packages/issues/15096.
There are two things I do not yet understand, so I am looking for a summary of how these things work or some recommended reading regarding them:
proc
to/proc
: Operation not permitted: OCI permission denied." Again, I am not sure what performs these privileged operations on other distributions. I did try to package fuse-overlayfs for OpenWrt, and I setmount_program = "/usr/bin/fuse-overlayfs"
in/etc/containers/storage.conf
, but this did not helpSteps to reproduce the issue:
podman build --tar containers .
.podman run containers
Describe the results you received:
Output of
podman version
:Output of
podman info --debug
:Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
Yes/Yes