Closed staticfloat closed 2 years ago
@DilumAluthge @keno I just pushed through a new version of UserNSSandbox_jll and found that this commit broke our ability to use unprivileged user namespaces.
Apparently the Linux kernel puts restrictions on who can mount sysfs
. Keno, do you have any magic incantations that we can use to get this to magically work?
Hmm, but that's about mounting the sysfs directly. Here we're just bindmounting, which should always be allowed.
Okay you prompted me to look a bit deeper, strace
shows the following:
[pid 3992451] mount("/sys", "/home/sabae/.julia/artifacts/562768a40e93d27b79fbedf9cfa7883409d494ea//sys", 0x4894c7, MS_BIND|MS_REC, NULL) = 0
[pid 3992451] mount("/sys", "/home/sabae/.julia/artifacts/562768a40e93d27b79fbedf9cfa7883409d494ea//sys", 0x4894c7, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_REMOUNT|MS_BIND, NULL) = -1 EPERM (Operation not permitted)
So it looks like the first mount passes, but when we try to remount, it fails. I guess that's not surprising since we're asking for things like MS_NODEV
, which likely wouldn't work. This is coming from this branch, which we do only when we're asking for a read-only bindmount. Should we mount /sys
without trying to remount, or should we alter the flags somewhat?
I suspect you are correct that the issue is MS_NODEV. Try dropping that: https://github.com/torvalds/linux/blob/952923ddc01120190dcf671e7b354364ce1d1362/fs/namespace.c#L2569-L2571
I suspect you are correct that the issue is MS_NODEV. Try dropping that: https://github.com/torvalds/linux/blob/952923ddc01120190dcf671e7b354364ce1d1362/fs/namespace.c#L2569-L2571
We need MS_NODEV, because that code path prevents us from clearing it on remount. In addition, we need to preserve noexec
, because /sys
here is mounted with nosuid,nodev,noexec
. That's the actual issue here, already noted by a TODO in the sandbox source code.
We need MS_NODEV, because that code path prevents us from clearing it on remount. In addition, we need to preserve
noexec
, because/sys
here is mounted withnosuid,nodev,noexec
. That's the actual issue here, already noted by a TODO in the sandbox source code.
Ah, good catch.
I'll have a look at fixing this, by parsing procfs (fdinfo and mountinfo) according to that TODO.
It turns out that we can't bindmount
sysfs
if we're using the unprivileged executor, which is our favorite executor to use. X-ref: https://github.com/nestybox/sysbox/issues/67#issuecomment-691524489This reverts commit a58ccf0caaaefd21e1b13c1591886f45c3f0bed7.