[RFC]: hyperkernel: high level architecture

NOTE: These changes will be made to the hyperkernel repo and not the main Bareflank repo as we want to keep Bareflank simple. Also note that the hyperkernel will leverage / extend the APIs provided by the extended_apis repo.

The goal of the hyperkernel is to provide Bareflank with guest VM support. Unlike traditional hypervisors however, the hyperkernel will implement a micro-kernel architecture, leveraging virtualization to not only execute traditional virtual machines (e.g. Windows, Linux, etc..), but also Unikernels like IncludeOS and regular applications (called vm-apps) like a traditional micro-kernel. In other words, the hyperkernel will bridge the gap between traditional hypervisors and operating systems to create a new kernel architecture that is capable of executing everything, in one heterogeneous environment.

Proposed Design:

Like a micro-kernel, the most security critical component of the hyperkernel is the code responsible for maintaining isolation. This includes isolating vCPUs and RAM (both the CPU's view and a device's view). The hypervisor component of the hyperkernel will only contain this logic. That is, the hyperkernel's drivers, schedulers, RAM allocation, etc... will all exist in virtual machines, outside of the hypervisor. These components can be in their own virtual machines (likely the type 1 case), or in a single virtual machine (likely the type 2 case), or any combination of the two. Keeping the hypervisor logic small not only helps reduce complexity, but drastically improves security.

The problem with this design is the chicken and egg problem. For example, if the scheduler, exec VM, and RAM allocation all exist in their own VMs, who starts these VMs if they have not been started to start themselves? To overcome this issue, the hyperkernel will leverage Bareflank's existing bfdriver / ELF loader bootstrapping code to pre-load the hypervisor, and the system critical VMs. Just like Bareflank, the hyperkernel will use this bootstrap logic from either UEFI (type 1 case), or and existing host OS (like Linux, Windows, etc... for the type 2 case). Once the hypervisor is up and running, the bootstrap logic will provide the hypervisor with a list of pre-loaded VMs for the hypervisor to execute. In the type 1 case, this would include VMs like the scheduler, exec VM, and RAM allocation.

Once the system critical VMs have been launched, the exec VM can launch additional virtual machines, which can be scheduled for execution by the scheduler VM.

RAM VM:

Just like the scheduler creates and destroys execution threads, a dedicated RAM VM will alloc / free resources. Once allocated, the hypervisor will guard / isolate said RAM using EPT, VT-d, etc... The important point here is that the hypervisor doesn't own a giant chunk of RAM that is gives out to VMs. Instead, a RAM VM owns free memory, and gives out pages similar to sbrk. If the hypervisor, the exec VM, or a vm-app needs additional RAM, it can allocate this RAM by asking the RAM VM for more resources. Although this design is less traditional, it solves a problem unique to Bareflank. Since Bareflank supports multiple operating systems and hypervisor types, it's not possible to give all free memory to the hypervisor because the type 2 case should only use as much free memory as needed (to leave remaining memory for the host OS to manage). Allowing a virtual machine to own free RAM satisfies both the type 1 case, and the type 2 case as the bootstrap driver can provide all free memory to a dedicated RAM VM while bootstrapping in the type 1 case, while the the hypervisor and VM's can ask the host OS for more memory in the type 2 case when needed.

Scheduler VM:

The scheduler VM runs the needed code to schedule virtual machines (both vm-apps and traditional VMs). In the type 1 case, this could be a dedicated VM with a custom scheduler algorithm, or it could be an existing scheduler from an OS that is started at boot (e.g. after the hypervisor and system critical VMs have been loaded, UEFI could continue to boot an existing operating system like Windows, Linux, etc... and it's exiting scheduler could be used). In the type 2 case, the host OS's scheduler would likely be used.

Exec VM:

The exec VM executes new virtual machines. The hyperkernel will not only isolate each vm-app, but it will also isolate based on groupings called "domains" (not to be confused with a Xen domain). Each domain that the hyperkernel is managing will have it's own exec VM, providing domain isolation from the most privileged VMs, all the way to the individual vm-apps. Each domain will have a vCPU per physical core (and thus a VMCS, exit handler, etc... per physical core) to run vm-apps, as well as a set of vCPUs for traditional OS's. vm-apps in the same domain will share the vCPUs, reducing the number of VMCS's that are needed, while also optimizing the CPU's cache while VPID is enabled, and the hypervisor will maintain isolation using traditional ring 0 / ring 3 isolation techniques. Traditional OS's however will have their own dedicated set of vCPUs like traditional hypervisors. This design maximizes efficiency without sacrificing isolation.

To launch a VM, the exec VM will allocate RAM from the RAM VM, and load the contents of the vm-app / OS into RAM, perform any needed ELF relocations / processing, and then inform the hypervisor (using the same APIs that the bootstrap driver used), to launch the vm-app / OS. Once a vm-app has concluded it's execution, the exec VM will keep the executable cached in RAM in case the executable needs to be execute again. The RAM VM will have a means to ask the exec VM if it has RAM it can give back in the event that resources become constrained. Caching vm-apps is specifically designed to optimize UNIX style applications that tend to execute in scripts at an alarming rate (e.g. think about how many times the [ binary is executed in a single bash script).

File System / Networking / Serial / Etc:

Like a micro-kernel, various different services will be provided in their own VMs. The hyperkernel will leverage VT-d were possible to provided individual vm-apps / OS's with better isolation. Ideally, new device drivers, or so-called AnyKernel drivers will be used to manage devices. In the event this is not possible, the hyperkernel architecture provides a unique opportunity to leverage existing device drivers from exiting monolithic kernels if needed, while still being able to provide isolation, and generic application support. For example, a vm-app could use open() to open a file handle to a file that it is interested in. Under the hood, libc will use the hypervisor's vmcall interface to make a VM->VM connection between the vm-app and the VM managing the filesystem (assuming policy allows), which could be Linux, and using this IPC mechanism, can acquire the needed handle. This same architecture could apply to more complicated devices like network devices. Existing monolithic kernels like Linux could be used to manage various different wired / wireless devices, abstracting these devices to a set of common IPC interfaces that vm-apps could connect to and use via an existing user-space socket library. Over time as custom, hyperkernel specific device drivers become available, dependencies on Linux could be removed.

Furthermore this design supports both type 1 and type 2 cases. For type 1, devices are divided in different VMs, based on the availability of VT-d, and how the device driver is implemented. The type 2 case however would just use the host OS for device support, and implement the needed IPC mechanisms that the vm-apps / OS's need.

Thoughts?

@brendank310 @ktemkin @tklengyel @connojd @mor619dx @RicoAntonioFelix @alfred-bratterud @ilovepi

Bareflank / hypervisor