put xsnap worker in a seccomp jail

warner commented 3 years ago

What is the Problem Being Solved?

seccomp(2) is a Linux kernel facility that allows a process to voluntarily give up access to nearly everything. Once invoked (in the original "strict" mode), the process can only use read(), write(), and _exit(). Any other syscall causes the process to be killed. The program can only read/write pre-existing file descriptors; without open() it cannot create any new ones. The lack of recv() means it cannot accept new file descriptors over a unix-domain socket either.

This limits the damage a compromised process could do to the system that hosts it. Its authority is limited to whatever actions will be taken on its behalf by the code at the other end of those file descriptors.

Our xsnap vat worker is a good candiate for this sort of isolation. The parent process sends it pass-by-copy "deliveries" to begin each crank. The xsnap worker executes the delivery, updates its internal state, and emits some number of swingset-syscalls (not to be confused with the linux-syscalls that seccomp(2) restricts). Those swingset-syscalls are expressed as more pass-by-copy messages sent over a pipe to the parent process, as are their return values.

A memory safety error in the XS engine (stack overflow, buffer overflow, use-after-free) would allow supposedly-confined JavaScript vat objects to compromise the entire xsnap process. Without something like seccomp(2), each process gets unrestricted access to the entire user account, which would allow it to modify the entire swingset state file (giving it full control over the swingset kernel and all the other vats it manages), as well as any secret access tokens or privileged hardware access avaiable to that user account.

But with the isolation, such memory errors would only give the attacking object control over the same set of authorities as the vat worker as a whole. This includes forging messages from other objects within the same vat (as well as observing their internal secrets, if the vat could keep secrets to begin with, which is not true for chain-based vats). It also gets access to any nondeterminism that was meant to be reserved for the liveslots layer (e.g. access to WeakRef) or the supervisor installed into the worker.

Description of the Design

To implement this, we'd have xsnap's C-based supervisor invoke seccomp() shortly after worker startup. At this point, the file descriptors are already opened, and the worker has switched into "react to messages over the pipe" mode.

We currently have xsnap write its snapshot files directly to disk (the piped message includes a filename to which the snapshot should be written), which obviously has to change, since seccomp(2) forbids open(). We must either send the snapshot data over the same message pipe used for commands and swingset-syscalls, or prepare a second pipe used just for snapshot data. The latter might easier to manage, especially because the xsnap snapshot writing process does not know ahead of time how large the snapshot will be (you give it a write() function and gets called a zillion times with tiny bits of data until the object graph has been traversed). I'm thinking the parent uses the command pipe to say "please write a snapshot", then starts copying any data from the snapshot pipe to a temporary file, counting bytes as it goes. Eventually the child sends a response on the command pipe that says "I finished writing NN bytes to the snapshot pipe", at which point the parent waits until the snapshot pipe's counter reaches that target, then closes and renames the file.

(the alternative would be to add framing to the messages sent over the snapshot pipe, so we could distinguish an EOF frame, but that would add some length-prefix overhead to each write(), and I expect those writes to be numerous and tiny, so the overhead could be significant)

(if we could open a new snapshot pipe for each act of snapshotting, we could use the pipe's EOF event for framing, but under seccomp(2) you can't)

At xsnap startup, we currently use a command-line argument to name a snapshot file to read from. We could keep doing this: we don't invoke seccomp until after we've finished loading the snapshot. For symmetry we could implement it the other way (use the snapshot pipe to write data into the child), but it seems unnecessary.

We might have some diagnostic instrumentation (like elapsed-time measurements) which would need to change under seccomp, because gettimeofday() is also forbidden. Some JS engines call functions like this constantly, but I think XS does not (they're conveniently conservative about platform expectations: the microcontrollers it runs on do not necessarily have a real-time clock at all).

seccomp-bpf

seccomp has a newer mode, in which the caller provides a Berkeley Packet Filter -syntax program, and this program gets to evaluate and approve each syscall. In this mode, we could allow direct writes to snapshot files while denying all other forms of IO. While possible, I'd prefer the strict mode, because:

we'd have to write that logic in BPF (or some language that compiles down into it)
we'd need to include that compiler in our build process
that logic would need to guard against safePrefix/../../../etc/passwd shenanigans
the prefix in question would be a runtime option so the xsnap parent might have to recompile the BPF each time the process starts
even providing a single directory to the child is more authority than it needs, it really only has to write to a single anonymous file

Security Considerations

This ticket is entirely about improving the security of our system by reducing authority given to the vat worker. Unless we manage to introduce a buffer overflow while modifying the C code to invoke seccomp, the result should be strictly safer than before. The biggest risk is accidentally killing the process when it does some benign-but-unexpected operation.

Compatibility Considerations

seccomp() is Linux-specific. If we use it, our validators will be limited to running on Linux kernels. It would be simple to sense the platform we're running on and only enable seccomp if it's available, removing that limitation, although we might prefer to mandate the improved security. We certainly want to enable developers to work on non-Linux platforms, but we could also accomplish that by just not using XS workers during development (which has other benefits, most notably for debugging). OT3H to get accurate metering information (which matches what happens on-chain), there should at least be an option for local development to use XS.

FreeBSD has a similar notion (I think "jail" is the keyword), but I'm not very familiar with it. I have no idea what Windows might do. Linux has a separate feature, misleadingly named "capabilities", which effectively breaks up root into a bitmap of allowed authorities, but linux-capabilites are rather coarse and probably wouldn't help us much. Linux (and some other Unix-ish systems) also has a notion of "namespaces", which can be used to isolate a process to a particular corner of the filesystem, as well as denying it the ability to observe other processes/etc. This could be used for isolation while still allowing limited file access. But our vat worker's platform needs are so modest, it seems appropriate to just lock down everything except the message pipe.

Test Plan

It would be nice to have a unit test which enables the jail, then attempts to make an illegal open() and watch the process get killed. However we'd have to include special code in the worker to expose any such syscalls for the test to exercise, and that code would not serve any other purpose but to enable such a test. It's probably worth doing, but feels a bit weird.

If we elect to allow workers to run without seccomp, we should have a test for that too. This test will probably need to run on a non-Linux CI system to exercise both cases properly.

cc @dckc @michaelfig @erights

dckc commented 3 years ago

... We must either send the snapshot data over the same message pipe used for commands and swingset-syscalls, or prepare a second pipe used just for snapshot data. ...

@warner came up with a simplification in recent discussion:

The parent opens a snapshot file for writing and passes this open fd to the child; when the child gets the "write a snapshot" instruction, it writes to this fd and then exits.

How often do we want / need to write a snapshot and keep going anyway? Maybe never, to start with?

warner commented 2 years ago

XS recently started using the mmap syscall to allocate memory, which means seccomp-1 will no longer work (it allows read, write, and sbrk, but not mmap). That means we'll need something like seccomp-bpf to implement this kind of isolation.

warner commented 2 years ago

@warner came up with a simplification in recent discussion:

The parent opens a snapshot file for writing and passes this open fd to the child; when the child gets the "write a snapshot" instruction, it writes to this fd and then exits.

I'm liking this option more and more. We still need seccomp-bpf to enable the recv call (read gets you data, but recv lets you accept the extra unix-domain sockets), but the BPF code to allow recv and mmap should be easy, and then we don't need BPF code to examine filenames securely.

How often do we want / need to write a snapshot and keep going anyway? Maybe never, to start with?

I think it's the dominant use case, actually. Currently, we exclusively do snapshot-and-keep-going.

warner commented 1 year ago

Two other technologies we might apply:

pledge: restrict the set of syscalls that can be made. unlike seccomp or seccomp-bpf, this is done from within the process, voluntarily giving up authority slightly after startup (but before any JS is executed)
unveil: same but restricts the set of filesystem paths that remain available to the process

Both are not native to linux, though: https://justine.lol/pledge/ is a project which aims to port the OpenBSD feature to Linux, in terms of seccomp-bpf. I think that cane be done from userspace (i.e. the validator operator doesn't need to install a kernel module or something), but I'm not sure. https://docs.kernel.org/userspace-api/landlock.html is probably a kernel module and seems to be what backs unveil on linux, which would be less convenient to deploy.

dckc commented 1 year ago

I just discovered bubblewrap, which seems to provide a pretty thorough sandbox as a parent process. It might save some development time.

OTOH, maybe it would just add complexity/risk.

warner commented 2 weeks ago

testing ideas with @siarhei-agoric : add a compile-time switch which exposes some benign syscall like getpid() to the start compartment, so we can exercise seccomp killing the worker when we send an eval('globalThis.getpid()') to the worker when it's in seccomp mode.

We need to make sure libc doesn't cache the getpid (or find some other benign syscall to use). We do not want to share a powerful syscall like open(), so auditors can evaluate the worst-case "what if the testing mechanism causes a vulnerability?" questions.

dup2 might be an option, it can be used as an attack vector (running out of file handles) but still doesn't enable open().

Note: with vdso, gettimeofday() can still work, despite seccomp, which would enable us to still get timestamps for the meteringResults.

The xsnap process could perform destructive testing of whether seccomp is available by spawning a child process at startup, wait for it to perform some tests (possibly killing itself in the process, by violating the rules), and then enabling/not seccomp itself based on the results.

siarhei-agoric commented 2 weeks ago

We need to make sure libc doesn't cache the getpid (or find some other benign syscall to use).

getpid() cache had been removed starting from glibc 2.25 [1, 2]. However, there is a possibility of having it implemented via vDSO which would also bypass the actual syscall [1]. On the other hand, both cache and vDSO can be avoided by invoking syscall(SYS_getpid) directly [3].

dckc commented 2 weeks ago

my assignment here looks pretty stale. I'm removing it.

Agoric / agoric-sdk