Create "encasement" library for new 3i interfaces

JustinCappos commented 1 month ago

We need a library like the encasement library for Repyv2 for the 3i interface in RustPOSIX. However, in our case, it is simpler since the memory isn't shared across cages naturally. Thus, you don't have to worry about TOCTTOU because the caller and callee don't have the same bits in memory.

I propose the creation of a library which makes it so that calls are routed the correct place and the correct arguments are copied, etc. My current thinking is you define some sort of bitmask as I do below to indicate what behavior you want for each arg. Importantly, arguments that should not be copied also can be avoided. You would define an interface you wish to provide.

For example, suppose that you wanted to track the amount of data written to files by the write system call. From a pseudocode standpoint, a simple bit of code using this library could look like this:

// From the encasement setup.  I don't need to pass the args types, etc. along because these are standard and don't change. 
// I need a good way to indicate what should happen for each.  I have things like COPY_ALL_ARGS and PASS_THROUGH_ALL
// but likely there needs to be some bitmask or similar.
...
[my_open_syscall, COPY_ALL_ARGS],   // int open(const char *pathname, int flags, ...mode_t mode), do the default for each arg type
[my_write_syscall, PASS_THROUGH_POINTERS]   // ssize_t write(int fd, const void buf[.count], size_t count), pass through the second argument (don't copy it or fill it in)
...

filecountmap = HashMap <string,int=0>;

int open(const char *pathname, int flags, ...mode_t mode) {

   filecountmap(pathname) = 0

   // This is the open call that is provided to my cage
   return open(pathname,flags, mode)
}

ssize_t my_write_syscall(int fd, const void *junk, size_t count) {
  // Note that junk will be set to NULL (because it isn't needed.  This means that buf isn't copied)

  // This is the write call that is provided to my cage
  int amount_written = write(fd,NULL,count);  // The second arg is ignored here due to pass through.

  // if they return an error, don't -1
  if (amount_written > 0) filecountmap(pathname) += amount_written;

  return amount_written

}

Things to ponder about this:

1) How do we handle vargs?

2) How do we handle errno? It likely should just be passed through

3) How do I make it hard for people to accidentally add a call in the wrong place?

4) Should this get imported everywhere and turn into preprocessor statements that get resolved out before runtime?

5) Should they be able to override a non-passed parameter by passing in a value?

6) If they call a function more than once, does the non-passed parameter only get added the first time?

JustinCappos commented 1 month ago

@rennergade @Yaxuan-w @yzhang71 We should talk about this at some point before whomever writes this code implements it (likely me).

rennergade commented 1 month ago

Yeah that makes sense. We should come up with some decisions before the WASM migration gets too far.

JustinCappos commented 1 month ago

I've had a chance to digest this and mull it over a bit more. The real question we are facing is where the argument serialization and memory copying should be performed.

For security reasons we cannot give a cage access to another cage's memory (at least without doing some trickery with using shared memory as a communication's buffer, etc.). Hence when we do this, some bit of trusted code (likely in a separate cage, possibly in the microvisor), will need to serialize and copy data across. For most data types this is really simple, but complex data types require a lot more hand holding. In the case of the POSIX interface, we can define all of the messy data types and have handlers for them. If we think ahead to converting calls between libraries into iPCs, then this becomes more tricky because the user can literally define whatever custom data types they want. I'm going to punt on the latter case for now.

Okay, so now that it's clear that we need to serialize information somewhere. I think that for RustPOSIX, it will need to read the memory of the cage anyways, so it is fine for it to copy this information (as is needed). The glibc side of things can in a straightforward way just pass the pointers, etc. over and trusted code outside of glibc can do all the copying.

For the eventual 3i, we likely will use a similar mechanism for intercage calling, with 3i bridging the connection and copying things as is needed. Fortunately, all of the calls provide a buffer for where to put information in the receiving cage. Calls like read() which put information into a provided buffer, need a little handling, but we likely can do this in a library that is imported by all POSIX providing code... It does not seem that conceptually complicated (at least right now).

JustinCappos commented 1 month ago

One minor gotcha (which I would have hoped we would have thought of) is that pointers do need to be properly aligned. So our checks not only need to check for the memory region, but also for alignment. https://doc.rust-lang.org/reference/behavior-considered-undefined.html

My understanding is that if someone were to pass a reference to an 32 bit integer from C which wasn't memory aligned, we'd need to copy that to a correctly memory aligned i32 type in Rust before using it. Given how C works, I'm not sure this is even possible, but there should be some unit tests for 3i that look at corner cases here.

My very limited understanding based on partially completing the Rust tutorial is that we should not run into other issues.

JustinCappos commented 1 month ago

Note: I'm going to talk about types as though we were doing this in Python using dicts because I know that language well enough to really say the "right" way to implement this. I think there is a solid argument to be made for using a HashMap in Rust, but I'm not comfortable enough with the language yet to really make that call.

I've thought some more about this and I think the basic architecture should work as follows:

A cage will make a system call by passing (syscallnum, arglist) to 3i.

There is a 3i module which lives in RustPOSIX. It has a syscall_3i dict which is keyed by the cage ID. The value for each entry in this dict is likely an array / list / tuple and I will call this the syscallhandlertable. Regardless, the position / key is the system call number you'd like to make. The value is the function to call. So, 3i will do syscall_3i[cageid]syscall

The individual calls in RustPOSIX will check arg values / convert arg types as is needed and use a helper function in 3i to clone, memcpy, etc. data as is needed. This should be separated out from the calls themselves so that it could later be used by other code as a library. So, there should be a call to an actual open(...) syscall handler in RustPOSIX which looks quite normal with respect to types at the end of this process. The code before this should be separable.

How this connects with 3i

Our goal is to let a cage interpose on system calls and adapt them for their own purposes. (This exists in SafePOSIX, but not our current codebase yet.)

To do this, suppose we add four new system calls: icmemcpy(dest, src, len,srccageid), icstrncpy(dest,src, len, srccageid), getsyscallhandlertable(), forkinterpose(syscallhandlertable). The icmemcpy/icstrncpy calls are basically just safe ways to copy information across cages. The getsyscallhandlertable() is really simple and just returns a copy of the syscallhandlertable for the cage. The forkinterpose(syscallhandlertable) syscall, which is identical to fork() only it also takes in a syscallhandlertable. This table will be used as the syscall_3i dict for the child cage.

To implement a cage which interposes, you would get your syscallhandlertable modify the calls you want to change to point to functions you implement, and then you would fork your child with forkinterpose, passing in your modified syscallhandlertable.

For safety reasons, any entry in forkinterpose's syscall table that differs from the caller's syscallhandlertable needs to be checked to ensure it is in the caller's address space or is in the caller's syscallhandlertable already.

Note that the call from 3i call should be made to the syscallname(args) part of the call inside the cage. Of course, the cross memory copying, etc. must be handled by calling into 3i / RustPOSIX's safememcpy for safety reasons.

Note also that the interface should be arglist instead of automatically having the args, etc. copied over. This is because for many types of interposition, you simply will not care about the actual args themselves. So the caller can freely mutate or ignore these. For example, if I'm trying to write an interposing cage which tracks the amount of data written, there isn't any reason to copy those bytes into my cage's memory. I just want to know the amount.

However, some libraries will want to do the copying of args, etc. and it's a good idea to help out the programmer. This is why making the above code in 3i into something that can be imported and used is useful.

rennergade commented 1 month ago

Thanks Justin, I was going to ask for more clarification. Starting to understand a bit more but still some confusion. I think a diagram of where all these components live especially in relation to the trusted/untrusted boundary and the rest of RustPOSIX would be really helpful.

JustinCappos commented 1 month ago

Okay, @Yaxuan-w, can you take a cut at the diagram based upon the slack channel discussion? I can help to iterate with you...

Yaxuan-w commented 1 month ago

I'm going to do an initial draft of the diagram and we can review it together at your convenience.

JustinCappos commented 1 month ago

In the meantime, I've been prototyping different designs here and ran into a few different corner cases of concern. Note: I'm going to call a cage that is a below another cage a "grate", to make naming easier.

The first set of questions are around error handling / exiting:

It's not clear how to handle an errors in a grate. Should a grate which receives a bad system call (e.g., invalid args, etc.) from a cage terminate the cage? If so, how?
What if a grate itself has an error and needs to exit? Should the cage above it exit? If not, what happens to the system call that grate was handling? If the cage above it exits, do all children of that cage also exit? What if one of these children didn't even use the grate (e.g., the system call the grate is handling is blocked)? Does it need to die?

The second set of problems are related to how grates and cages are connected.

Can a cage have multiple grates? How is this handled? I'm assuming that each cage uses the normal system call API from a grate beneath it because the code is legacy and clearly will not change. (Note, we need to consider library decomposition more carefully, since this will use the same mechanism.)
How do we refer to the system calls a grate provides? If we use syscall numbers, what happens when we want to have both an in memory file system and a regular file system be accessible to a grate which then filters them out into separate calls? How do you tell what calls a grate provided to you if it just using a number of a syscall? How do we handle inter-cage calls cleanly given they will have unknown types and unknown names? Should there be some way to pass the ability to call a grate to a cage (like capability passing in Amoeba)?

I'm not saying we want to use it as a model, but for context: RepyV2 sidesteps the naming issue by passing a dictionary into the namespace of the module above it. So it largely sidesteps the naming issue because the typical thing to do in a grate is to choose a new name if you didn't want to override behavior and then either have a grate which is higher up decide how to reconcile the request or you would expect the higher up code to handle this. Since interposition is between bespoke interfaces and was below the POSIX layer, this was messy, but works. There is usually only one stack of grates in the system, so we avoided most other issues. Erroring in a grate implies\ exiting the whole system, usually. There isn't a built in way to pass a reference to another grate in RepyV2, but it is easy to add into a grate, since all you're doing is passing a function pointer to something else.

With 3i's isolation, things are much more separated due to namespacing, etc. This has pros and cons, but does mean we can handle failures in a more granular way, while also making any naming / reference passing something the 3i trusted code needs to be involved in.

Lind-Project / lind_project

Create "encasement" library for new 3i interfaces #367