[RFC]: hyperkernel: bfexec

rianquinn commented 7 years ago

NOTE: These changes will be made to the extended_apis repo and not the main Bareflank repo as we want to keep Bareflank simple.

The first component of the hyperkernel to be created will be bfexec. In order to be able to execute applications, we need to be able to load executables into memory, perform whatever ELF transforms that are required, and then tell the hypervisor to execute them. There are a couple of scenarios that I would like to focus on with respect to this capability:

Introspection App

One type of application that I would like to support is an introspection application. Imagine an application written in C++ (and for now relying on a modest set of libraries, if any), that is capable of introspecting the host OS (the OS currently installed on the physical hardware). Since Bareflank acts like a VMX rootkit, it can be installed and executed on an existing Windows / Linux install. Using Bareflank, an introspection application can be executed side by side with the existing Windows / Linux install, and spy on what it's doing. This could be to implement a new form of Anti-Virus, or this could be to implement a new reverse engineering capability. In a production environment this application might be booted along with Bareflank via UEFI prior to booting Windows / Linux, or (for example by a developer) it might need to be executed repetitively after Windows / Linux has been booted.

Regular Application

Eventually, I would like to see regular applications supported by the hyperkernel. This long-term goal would essentially redefine a brand new OS model, capable of executing both applications and guest OS's side by side, using the same micro-kernel style vmcall interface. This means applications like [ and ls need to be able to be executed very quickly, otherwise something like a bash script will never work (i.e. caching and shared libraries would need to be supported)

Unikernel

Unikernel's today have the added advantage of being able to package up a set of libraries / binaries into a single blob, that only rely on a PV interface (and often times a device model), to execute. A great example of this is IncludeOS. An application like this should be simple to execute. Load the binary into memory, tell the hypervisor it has a contiguous view of guest physical memory, provide a start address, and push the big red start button. Initially I would like to keep our support to IncludeOS (the guys from this project have been great @RicoAntonioFelix / @alfred-bratterud), as we don't have plans to support a device model (and I have been told IncludeOS could be modified to only use a PV interface).

Eventually, we should be able to modify the Unikernel environment to boot more complicated OS's like Linux / Windows by providing existing kernels with a guest UEFI environment. The PV interfaces would exist in the form of UEFI protocols, very similar to the approach Bhyve is taking. So long as more complicated OS's / Unikernels can exist with only UEFI / PV, it should work without the need for QEMU.

Linux Kernel

The Linux kernel is basically an ELF binary, and thus it should be able to be executed like any other VM application being described here. Instead of libc encapsulating the PV interface like a vm app, the kernel would contain PV drivers that provide support for disk, networking, etc...

Design

To support all of these different types of applications, something will have to be in charge of loading an ELF binary (or blob in the case of a Unikernel) into memory, performing any needed ELF transforms (like relocations), and then telling the hypervisor to start the new application. There are two different ways this application could be started.

Type 2

In the type 2 case, bfexec is like any other application running on the host OS. It's written in C++, which means that when it runs fstream commands, libc that comes with the host OS provides bfexec with the needed JSON files for what to load, as well as the modules to load. Libc also provides sbrk, and the host OS is capable of providing heap memory as normal. The application would use the existing Bareflank ELF loader to load an executable into heap memory, and then it would perform the needed ELF transforms, and finally it would provide the hypervisor with a memory descriptor list of what to load, and what permissions memory needed, and finally the start location. The hyperkernel would then prepare it's internal structs (a topic for another RFC), and finally it would run the application in it's own VM.

One question that will be addressed in a future RFC is how to deal with interrupts. The very first application will be a simple "hello world" app with interrupts disabled, which means that bfexec will execute the app, the app will finish, and execution and queued interrupts will be resumed in the host OS. Once that is working, we will need another RFC to discuss how to interrupt the app, hand control back to the host OS, and then resume execution of the application. Even more complicated topics include threading.

Type 1

In the type 1 case, the bootloader will load various different VMs that are needed to execute the system as a whole. We can view these as system critical VMs. One of these system critical VMs is bfexec as it's really the initial shell that starts execution of the remaining applications / VMs. It's libc is no longer provided by the host OS, but instead is provided by Bareflank, and things like read / write and sbrk perform vmcalls to other system services like a filesystem, or RAM VM. The same code that runs in the type 2 case will work in the type 1 case by replacing the libc from underneath it. The difference is in the type 2 case, libc is performing syscalls, while in the type 1 case, libc is performing vmcalls, and using micro-kernel style IPC to talk to the needed system service. One important note here is this IPC must be designed in such a way that it can be used by both applications, and OS's using PV drivers. This way, both VM apps, and OS's can execute side by side.

One interesting result of this concept is the flexibility the developers will have when creating vm apps. For example, suppose your writing a new anti-virus that uses VM introspection. You will likely want Bareflank and the vm app to boot prior to Windows (i.e. type 1 case). The problem is... that's a terrible way to develop the code. Thanks to Bareflank, (once the UEFI work is done), you will be able to work on the vm app in the type 2 environment, starting and stopping your VM app over and over until you get it just right, and then deploying it to your customers in the type 1 environment.

Proposed APIs

To start, I would like to propose the following APIs

bfexec app args

bfexec app args: Loads the ELF binary that is provided, and executes it, passing the provided args to the int main(int argc, const char *argv[]) function in the ELF binary.

Thoughts?

rianquinn commented 7 years ago

Turns out that IncludeOS already supports being loaded as an ELF binary, so for now, we will skip the "blob" part, and just focus on the ELF binary part as this will support everything minus Windows. Once it comes time to support Windows, we can add blob support then.

alfreb commented 7 years ago

Would you consider multiboot? It turned out to be surprisingly simple, at least from our perspective. Essentially you'd provide a magic number and a pointer to some structs to our start. The most useful aspect for us is the memory map info provided in those structs, as well as command line parameters. Form the ELF perspective it's just a small extra header telling you how to boot it.

rianquinn commented 7 years ago

@alfred-bratterud I'll have to look at it, but I don't think that's going to be a huge issue. I'm going to want to see an ELF binary. One part of that will be what the "start" symbol is, which we can define to be anything. Correct me if I'm wrong, but basically in your case, instead of me running start, I would setup a MultiBoot environment, and then run from there. If that's the case then it should be pretty simple.

One question I have is, how would the ctors/dtors and eh_frame sections be handled. Normally for vm apps, I will be linking our C runtime code into the library, and then I would include some code to run them before and after main(). If I am running MultiBoot how is this done? Or do you guys handle that yourself with something like a linker script?

Also... how are args passed in? Once again, in the case of an executable, I will be placing the args on the start prior to running so that they are there and waiting for the executable's main... how is that handled with MultiBoot?

rianquinn commented 7 years ago

One change that I would like to make to the RFC here is to get rid of the need for the config.json. Instead, I would like bfexec to read the needed information from the ELF file. I'm not sure if a non-os version of the cross compilers will support this, but if it does, we could read the dependencies from the ELF binary, and then get the args from the command line args being passed to bfexec. In other words... you would run vm apps using the following:

bfexec my_app arg1 -oarg1 ...

If the executable that we are provided is not an ELF file, we could detect that, and handle it as needed.

alfreb commented 7 years ago

Hi @rianquinn sorry for the late reply! With multiboot - all that is from my perspective is information passing between bootloader / hypervisor and kernel / binary. We provide the multiboot header, which tells you what you need to know in order to boot us - e.g. location of _start. Once booted, you provide an in-memory struct to our binary containing useful information such as a memory map.

rianquinn commented 7 years ago

I updated the proposed APIs to reflect the comments. Loadings a binary blob will likely be handled by something other than bfexec, and will likely not be supported until we tackle Windows guests. For now, bfexec will be limited to execute ELF binaries.

The only thing I don't have sorted out is ring 0 vs ring 3. The first code dump will use ring 3, but IncludeOS and other unikernels will need ring 0.

To sort this out... for now, if I don't see bfcrt in the binary, I will load the ELF binary with ring 0, and I will not load (because I cannot) the CRT code, which should be fine since IncludeOS and others handle this on their own. The goal here is to simply accept an ELF binary from IncludeOS that they build with their own build system, and run it, so we will need to have something in our vmapps that says... this is a vmapp, and not a normal VM.

@alfred-bratterud Thoughts?

rianquinn commented 7 years ago

The initial version of this is done. There is still a ton of work to do here, but if you want to see a VM app working, feel free to give the hyperkernel repo a try

Bareflank / hypervisor