JuliaContainerization / Sandbox.jl

The cultured host's toolkit for ill-mannered Linux guests.
Other
35 stars 5 forks source link

Revert "Bindmount the sysfs into our root_dir" #97

Closed staticfloat closed 2 years ago

staticfloat commented 2 years ago

It turns out that we can't bindmount sysfs if we're using the unprivileged executor, which is our favorite executor to use. X-ref: https://github.com/nestybox/sysbox/issues/67#issuecomment-691524489

This reverts commit a58ccf0caaaefd21e1b13c1591886f45c3f0bed7.

staticfloat commented 2 years ago

@DilumAluthge @keno I just pushed through a new version of UserNSSandbox_jll and found that this commit broke our ability to use unprivileged user namespaces.

Apparently the Linux kernel puts restrictions on who can mount sysfs. Keno, do you have any magic incantations that we can use to get this to magically work?

Keno commented 2 years ago

Hmm, but that's about mounting the sysfs directly. Here we're just bindmounting, which should always be allowed.

staticfloat commented 2 years ago

Okay you prompted me to look a bit deeper, strace shows the following:

[pid 3992451] mount("/sys", "/home/sabae/.julia/artifacts/562768a40e93d27b79fbedf9cfa7883409d494ea//sys", 0x4894c7, MS_BIND|MS_REC, NULL) = 0
[pid 3992451] mount("/sys", "/home/sabae/.julia/artifacts/562768a40e93d27b79fbedf9cfa7883409d494ea//sys", 0x4894c7, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_REMOUNT|MS_BIND, NULL) = -1 EPERM (Operation not permitted)

So it looks like the first mount passes, but when we try to remount, it fails. I guess that's not surprising since we're asking for things like MS_NODEV, which likely wouldn't work. This is coming from this branch, which we do only when we're asking for a read-only bindmount. Should we mount /sys without trying to remount, or should we alter the flags somewhat?

Keno commented 2 years ago

I suspect you are correct that the issue is MS_NODEV. Try dropping that: https://github.com/torvalds/linux/blob/952923ddc01120190dcf671e7b354364ce1d1362/fs/namespace.c#L2569-L2571

maleadt commented 2 years ago

I suspect you are correct that the issue is MS_NODEV. Try dropping that: https://github.com/torvalds/linux/blob/952923ddc01120190dcf671e7b354364ce1d1362/fs/namespace.c#L2569-L2571

We need MS_NODEV, because that code path prevents us from clearing it on remount. In addition, we need to preserve noexec, because /sys here is mounted with nosuid,nodev,noexec. That's the actual issue here, already noted by a TODO in the sandbox source code.

Keno commented 2 years ago

We need MS_NODEV, because that code path prevents us from clearing it on remount. In addition, we need to preserve noexec, because /sys here is mounted with nosuid,nodev,noexec. That's the actual issue here, already noted by a TODO in the sandbox source code.

Ah, good catch.

maleadt commented 2 years ago

I'll have a look at fixing this, by parsing procfs (fdinfo and mountinfo) according to that TODO.