JuliaContainerization / Sandbox.jl

The cultured host's toolkit for ill-mannered Linux guests.
Other
35 stars 5 forks source link

RFC: Add support for overlay maps #106

Closed maleadt closed 1 year ago

maleadt commented 1 year ago

Adds overlay maps to the usernamespace sandbox so that we can create overlay mounts (e.g. to make read-only mounts mutable) in a controlled manner (i.e. without having to peek into the executor's persistence_dir):

using Sandbox

overlay = mktempdir()
config = SandboxConfig(
    Dict("/" => Sandbox.debian_rootfs(),
         "/root/.julia/packages" => "/home/tim/.julia/packages");
    overlay_maps = Dict("/root/.julia/packages" => overlay),
    stdin, stdout, stderr, persist=false, verbose=true,
)

with_executor(UnprivilegedUserNamespacesExecutor) do exe
    run(exe, config, ignorestatus(`/bin/bash -c "touch /root/.julia/packages/foo"`))
end

# $overlay/upper/foo now exists
verbose sandbox enabled (running in unprivileged container mode)
Parsed --rootfs as "/home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f"
Parsed --cd as "/"
Parsed --map as "/home/tim/.julia/packages" -> "/root/.julia/packages"
Parsed --overlay as "/tmp/jl_XwYhld" -> "/root/.julia/packages"
Parsed --uid as "0"
Parsed --gid as "0"
Child Process PID is 1712994
--> Mapping 1000:1000 to 0:0 within container namespace
--> Creating overlay workdir at /root
--> Mounting overlay of /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f at /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f (modifications in /root/upper/rootfs, workspace in /root/work/rootfs)
--> Bind-mounting /home/tim/.julia/packages over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/root/.julia/packages (read-only)
--> Mounting procfs at /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/proc
--> Bind-mounting /dev/null over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/null (read-write)
--> Bind-mounting /dev/tty over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/tty (read-write)
--> Bind-mounting /dev/zero over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/zero (read-write)
--> Bind-mounting /dev/random over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/random (read-write)
--> Bind-mounting /dev/urandom over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/urandom (read-write)
--> Bind-mounting /dev/shm over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//dev/shm (read-write)
--> Bind-mounting /sys over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f//sys (read-only)
--> Bind-mounting /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/dev/pts/ptmx over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/dev/ptmx (read-write)
--> Mounting overlay of /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/root/.julia/packages at /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/root/.julia/packages (modifications in /tmp/jl_XwYhld/upper, workspace in /tmp/jl_XwYhld/work)
Entering rootfs at /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f
--> pivot_root() succeeded and unmounted old root
About to run `/bin/bash` `-c` `touch /root/.julia/packages/foo`
Child Process 1712994 exited with code 0

Intended use case: PkgEval, where we need a way to mount the package store. That mount needs to be read-only, since concurrent access results in corruption, yet we need a way to catch changes and add them to the cache afterwards. We cannot use layered depots for this because compilecache files hard-code the path to the Julia sources (see https://github.com/JuliaCI/PkgEval.jl/issues/158).

It would be cleaner if the overlay dir passed to the container would only be used as the upper dir, so that we can directly use the files in there, and not have to worry about removing the work directory (which has bad permissions that trigger a Julia bug, and may even be root-owned in the case of the PrivilegedUserNamespaceExecutor), but sadly the workdir has to live on the same filesystem as the upper dir.

codecov[bot] commented 1 year ago

Codecov Report

Base: 79.71% // Head: 79.71% // No change to project coverage :thumbsup:

Coverage data is based on head (975c302) compared to base (da1ccc1). Patch has no changes to coverable lines.

:exclamation: Current head 975c302 differs from pull request most recent head a3d1210. Consider uploading reports for the commit a3d1210 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #106 +/- ## ======================================= Coverage 79.71% 79.71% ======================================= Files 6 6 Lines 567 567 ======================================= Hits 452 452 Misses 115 115 ``` | [Impacted Files](https://codecov.io/gh/staticfloat/Sandbox.jl/pull/106?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Elliot+Saba) | Coverage Δ | | |---|---|---| | [src/SandboxConfig.jl](https://codecov.io/gh/staticfloat/Sandbox.jl/pull/106/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Elliot+Saba#diff-c3JjL1NhbmRib3hDb25maWcuamw=) | `91.30% <ø> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Elliot+Saba). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Elliot+Saba)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

maleadt commented 1 year ago

Fixed an issue: maps now need to be mounted before procfs or we aren't using the tmpfs meant for overlays.

The remaining CI failure is the test that read-only mounts are actually read-only (from the sandbox POV), which is of course what this PR is proposing to change.

maleadt commented 1 year ago

Scratch that, I've redesigned the feature after some thinking, introducing an --overlay HOST_PATH:SANDBOX_PATH to be used in combination with read-only maps. This keeps everything working as it currently does, while introducing (UserNSExecutor-only) overlay functionality to make read-only maps mutable in a controlled manner (i.e. without having to peek into the persistence_dir):


using Sandbox

overlay = mktempdir()
config = SandboxConfig(
    Dict("/" => Sandbox.debian_rootfs(),
         "/root/.julia/packages" => "/home/tim/.julia/packages");
    overlay_maps = Dict("/root/.julia/packages" => overlay),
    stdin, stdout, stderr, persist=true, verbose=true,
)

with_executor(UnprivilegedUserNamespacesExecutor) do exe
    run(exe, config, ignorestatus(`/bin/bash -c "touch /root/.julia/packages/foo"`))
end
run(`find $overlay`)

The only annoying part is that this requires --persist, i.e., it doesn't work with the ephemeral one mounted at /proc because mount_overlays runs after procfs has been mounted again. We also cannot reorder that code, because it needs to run after mount_maps (since overlays are put on top of other mounts), yet mount_maps requires procfs to look into mtab.

maleadt commented 1 year ago

OK, made this work with persist=false too. This should be good to go, or good to review at least.

@staticfloat Could you explain why you were mounting the overlayfs-backing tmpfs on /proc? That breaks lots of things (and is why you quickly restore it), but I don't see why we're doing this in the first place. Many other directories are guaranteed to exist as per FSH, and this is outside of the sandbox' root we'll pivot into anyway, so should be invisible wherever we mount it. In this PR, I've changed it to mount at /root, which seems to work fine.

maleadt commented 1 year ago

FWIW, I actually started out with something simple that made read-only mounts use overlays, https://github.com/staticfloat/Sandbox.jl/commit/3e534e139f25e38f5245c4c22c574ec3eeb04c38, but because it would then behave differently from the Docker executor I went with a separate --overlays option.

staticfloat commented 1 year ago

Okay, I've been thinking about this, and while I like it, I'm not happy with the API. With this, we have the following:

1) A way to create read-write mappings (--workspace) 2) A way to create read-only mappings (--map) 3) By default, an overlay that allows modifications to the rootfs (but not to read-only mounts), which can be persisted or not (--persist) 4) A way to selectively apply an overlay to paths, granting similar functionality to --persist but to the read-only mounts (--overlay).

IMO, this is too complicated. In terms of realistic needs, I think we only actually care about the following things:

1) I need to be able to modify files in the host through read-write mappings. 2) I want to be able to do bad things in the sandbox without messing up anything in my outer system. 3) I want to be able to persist changes.

Improved API

From this perspective, I think a better API would be to have the following pieces of functionality:

1) --map and --workspace remain. (perhaps with a unified API, --map host:guest and --map host:guest:ro, if we wanted to copy docker) 2) We apply overlays to all read-only mounts (including the rootfs) by default. We store them all in the persistence directory so --persist works with all of them at once.

What we lose with this is the ability to get read-only file system errors, which I think is fine. Regarding docker compatibility, I think we can get this working in docker as well by generating an entrypoint bash script that runs the moral equivalent of:

for mount in $ro_mount_list; do
    workdir=$persist_dir/$(hash $mount)
    mkdir -p $workdir
    mount -t overlay overlay -o lowerdir=$mount,upperdir=$workdir/upper,workdir=$workdir/work $mount
done
umount $persist_dir

Since we usually run with --privileged, that should work.

maleadt commented 1 year ago

OK, I implemented (the core part) of that API. One problem that surfaced is that overlayfs doesn't support mounting a single file, breaking e.g. the internet tests here (which mount /etc/resolv.conf). Do we want to support that? I guess we could detect the file case, mount the parent dir in a temporary location and bind-mount the file, but that feels hacky.

EDIT: pushed that hack. Makes mount looks funny, but it works:

Child Process PID is 1914942
--> Mapping 1000:1000 to 0:0 within container namespace
--> Creating overlay workdir at /root
...
--> Mounting overlay of /tmp at /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/proc/etc/resolv.conf (modifications in /root/upper/etc/resolv.conf, workspace in /root/work/etc/resolv.conf)
--> Bind-mounting /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/proc/etc/resolv.conf/resolv.conf over /home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f/etc/resolv.conf (read-write)
...
About to run `/bin/bash` `-c` `mount`
overlay on / type overlay (rw,relatime,lowerdir=/home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f,upperdir=/root/upper/rootfs,workdir=/root/work/rootfs,index=off,xino=off,metacopy=off)
overlay on /proc/etc/resolv.conf type overlay (rw,relatime,lowerdir=/tmp,upperdir=/root/upper/etc/resolv.conf,workdir=/root/work/etc/resolv.conf,index=off,xino=off,metacopy=off)
overlay on /etc/resolv.conf type overlay (rw,relatime,lowerdir=/tmp,upperdir=/root/upper/etc/resolv.conf,workdir=/root/work/etc/resolv.conf,index=off,xino=off,metacopy=off)
maleadt commented 1 year ago

Pushed a commit that does the entrypoint hack as suggested by @staticfloat. I'm not particularly happy with that, as it only works when we have appropriate privileges, but on the other hand with --overlay we didn't support any of this using the Docker executor so I guess it's an improvement of some sort.


EDIT:

[  387.893369] overlayfs: filesystem on '/var/overlay//read_only/upper' not supported as upperdir

Gah. What if we used a host dir, and implement persistence just like how it's done with the UserNS executor?


Well, that works, except for the rootfs. Guess Linux doesn't like double overlayfs mounts? Using two separate persistence mechanisms works around this, but is not nice.

maleadt commented 1 year ago

I'm not happy with how this turned out; the Docker entrypoint solution (in order to preserve compatibility with the UserNS sandbox) too way too messy: using two persistence mechanisms, requiring privileged containers AND overlayfs support, and likely to break when using stripped container images (say, not containing a /bin/sh).

So instead I've opted to switch PkgEval over to using OCI container engines directly, https://github.com/JuliaCI/PkgEval.jl/pull/177. That gives me the control I need (pretty easy, too; instead of generating a SandboxConfig I just need to build slightly more verbose JSON), as well as many other features I would have to add to the UserNS sandbox here (notably cgroup resource limits).

So I'll go ahead and close this. FWIW, I think it would make sense to remove the UserNS sandbox here and build Sandbox.jl on top of an OCI container runtime; the code from https://github.com/JuliaCI/PkgEval.jl/pull/177 could help with that. It supports almost all features of Sandbox.jl, with the exception of whole-root persistence (mounting an overlay at / seems unsupported by both crun and runc).