Solo5 / solo5

A sandboxed execution environment for unikernels
ISC License
900 stars 139 forks source link

Hardware access implementation discussion #425

Open bonkf opened 5 years ago

bonkf commented 5 years ago

tl;dr: I'll be implementing PCIe and DMA support in hvt and MirageOS and am looking for other people's requirements/suggestions.

I'm about to start my master's thesis at the Chair of Network Architectures and Services at the Technical University of Munich. The goal of my thesis is to implement direct hardware access (PCIe and DMA) into MirageOS, specifically to allow drivers (including my driver ixy.ml) to take direct control of hardware devices.

Motivation

The main reason hardware support is of interest is obviously performance: Moving packet buffers (or any other data) between unikernel and hypervisor/runtime takes time. Having the unikernel directly control its hardware should (hopefully) show some significant performance improvements. I'll mostly focus on networking, but hardware support could also be the foundation for other people to build drivers on top of, maybe something like storage drivers (NVMe?).

Additionally as our paper showed, there's a case to be made for writing drivers in high-level languages and not using the standard C drivers baked into your garden-variety Linux kernel.

Finally I'm also interested in the flexibility this brings for unikernels: @hannesm and I discussed implementing batch rx/tx and zero-copy into MirageOS' network stack; something we could more easily test with ixy.ml, which already has support for batching. Parts of #290 for example could be implemented inside the unikernel itself, keeping Solo5's codebase small and maintainable.

I'm hoping to get some discussion on implementation details going here. The reason I'm opening this issue here instead of on the main Mirage repo is that most of the changes I'll have to make will be on Solo5. Right now I'm only looking at mirage-solo5 and hvt on Linux as I'm hoping to get a proof-of-concept up and running early into the 6 months I have to finish the thesis. Depending on how much time I have left I'll also take a look at the BSDs.

So let's talk details! There's two main features my driver specifically needs.

PCIe

First off PCIe register access: ixy.ml needs to configure the network card it wants to control by writing to its configuration registers. On Linux those registers are mapped to the /sys/bus/pci/devices/$PCI_ADDRESS/resource* files. Additionally ixy.ml needs to enable DMA by setting a bit in the PCI configuration space, which is mapped to /sys/bus/pci/devices/$PCI_ADDRESS/config. There needs to be some way for unikernels to access these files in an implementation-hiding fashion. Also there are some trivial details to take care of, like what the command line flag for mapping a device into a unikernel (--pci=0000:ab:cd.e for example) should be.

DMA

Secondly there's the (more complicated) issue of DMA: Modern high-performance PCIe devices such as network/storage/graphics cards all use DMA (direct memory access) to communicate with their host system. Generally a driver instructs a device to read/write data from/to specific physical memory addresses. Those addresses must never change (without the driver knowing, I guess), otherwise the device will access arbitrary memory, leading to stability and (more importantly) security problems. Imagine the OS places some secret data in a specific physical location that previously was mapped to the device while the device has not been reconfigured. A NIC for example would happily send out your private data as network packets! This is not acceptable and so @mato and I already decided that IOMMU support is 100% required.

The IOMMU does for PCIe devices what your processor's MMU does for your programs: It translates virtual addresses into physical addresses. When we configure our DMA mappings into the IOMMU (for example by using Linux's vfio framework like ixy.c and ixy.rs already do) the device will also use virtual addresses. This means the driver won't have to take care of translating between virtual and physical addresses. Additionally the IOMMU will block any access to memory areas outside of the mappings the device is allowed to access.

When configuring the IOMMU we must also look at what hardware we are actually running on. The TLB inside the IOMMU usually has way fewer entries than the main MMU's TLB and there are different entry sizes (4KiB, 2MiB on x86_64 for example) available on different architectures.

So how should I go about implementing this? As I see it there needs to be some command line flag like --dma=16M, instructing hvt to configure a 16 Mebibyte mapping into the IOMMU. Then there needs to be a way for the unikernel to retrieve its mappings.

I think some mechanism for drivers to indicate their specific requirements (possibly before actually running) would be helpful for users. For example ixy needs some contiguous DMA areas for its packet buffer rings. Otherwise driver authors would be forced to do something like this when these requirements aren't met:

[ERROR] I am $driver and I need more memory. Please run again with --dma=32M

Suggestions?

Those were just the features ixy.ml and, I imagine, other network drivers need. I'm interested in other people's requirements; are there applications that need other features? Do you have suggestions/wishes/tips?

fwsGonzo commented 5 years ago

One of the first things we got asked was how to enable VLAN support. So, that should probably be on your list of things to support. I don't know how much you have to do on the link layer end, but it's probably not going to be that much. :) Good luck

avsm commented 5 years ago

Thanks for the excellent upfront planning of your forthcoming thesis work; I look forward to seeing this progress! My initial reaction is to encourage you to look at previous work in this space as regards virtualising PCI; there are a number of details and device-specific quirks that previous hypervisors have run into (for instance, when expanding out to display hardware or other non-networking PCI devices).

In particular, Xen has a well-trodden pci front/back split device that also has been deployed on quite a variety of real-world PCI hardware. Its pcifront PV interface should provide good inspiration for the Solo5 support, but without the Xen-specific pieces such as its ring.h and Xenstore bootstrap.

KVM has yet another approach which required Intel VT-d for the majority of usecases (and, for arm64, fairly intricate use of MSIs on arm64).

My instinct is to push for the "most software" solution that doesn't leave hardware underutilised, and the Xen PV approach seems to fit that bill. VT-d is a very large hammer to use for virtualising PCI devices.

Once you do decide on a good mechanism to get pages through to the host device, I'd be curious to know if you're interested in putting an FPGA into the mix. The Mirage stack makes it fairly easy to decide on the structure of the inputs to the software, which do not have to be what we traditionally have (TCP/IP offload and checksumming). If you had a hardware device that could do some custom protocol-level parsing, we could implement protocols that are not traditionally hardware-accelerated by implement some scanning logic in the "NIC/FPGA" and passing those structures through to a software Mirage stack to parse. I'm thinking specifically of QUIC/HTTP3 here, which is difficult to accelerate using TCP offload since... it's not TCP :-) This is a very exciting research direction to take the Mirage software stack, and particularly so in the context of RISC-V which @kayceesrk @samoht @nojb and I have been working on recently.

bonkf commented 5 years ago

@fwsGonzo What do you mean by VLAN support and how does it relate to hardware access? Do you mean VLAN offloading? That would have to be done on the driver/network stack end.

@avsm Thanks for all the resources! I'll look through everything!

One question up front, though: Is Xen really the right approach? This sounds very naive, but do we want unikernels to deal with the intricacies of PCIe devices? My initial idea was to bind our device to the vfio-pci driver, map the configuration space (maybe even read-only if we enable DMA by default) and the resources into the unikernel's address space and hope for the best. DMA would be handled the same way ixy.c does: Allocate some memory and tell vfio about it to create an IOMMU mapping. Then we just need to pass the unikernel the virtual and "physical" addresses. That should theoretically be enough to get ixy.ml running on hvt/KVM (after wrapping everything in OCaml, of course).

Other host OS would be a completely different story, of course, and I admittedly have not looked into that yet.

I'm definitely interested in other hardware besides network cards; I just happen to know a bit about network cards already. I don't have any experience regarding FPGAs; I know what they are and that's about it. I'd have to ask around at university if there are specific FPGAs around I could experiment with. Custom protocol-level parsing goes a bit beyond the scope of my thesis though :-)

Do you have resources and/or tips for approaching RISC-V? Especially IOMMU stuff? I'll start with x86_64 for now but I'm always curious about other architectures.

I feel like I'm approaching the problem from too much of an ixy.ml-focussed perspective: ixy.ml doesn't care about anything Linux's PCI code does and also just runs as root without a care in the world.

fwsGonzo commented 5 years ago

@Reperator Sorry, I misunderstood your task then. :)

anmolsahoo25 commented 5 years ago

This is a quite an impressive plan for you thesis! Wishing you all the best for your progress!

@avsm, I think even the Xen virtualization link, talks about using the Intel VT-d extensions for the PCI passthrough, if I am right?

HVM guests require no special configuration for the guest kernel, as all accesses are emulated and virtualized by the IOMMU hardware.

In that case, it would always be advisable to use the VT-d extension, because I assume Xen is taking care of the address mappings through the pcifront/pciback interfaces. Also, this comparison, uses the vfio-pci on Linux, which provides an interface to the Intel Vt-d virtualization and they do not report any significant performance impact.

In that case, @Reperator, whether you are going through Xen or KVM, both of them implement PCIE passthrough using the Intel IOMMU.

One question up front, though: Is Xen really the right approach? This sounds very naive, but do we want unikernels to deal with the intricacies of PCIe devices?

I think I misunderstand some part here. When you write a userspace PCIE driver against vfio-pci, you still have to manage the PCI device, i.e. mapping the configuration registers and BARS into the address space of the process and doing the reads/writes according to the device specification? vfio-pci does the minimal work of exposing a userspace device for PCI transactions, setting up the groups and containers according to the IOMMU and routing the transactions to the pci subsystem. So why would the case of Xen be special? As far as I can understand, even in Xen, the PCI passthrough would expose the xen-pcifront, which takes care of the PCI specific behavior which vfio-pci handles.

I think my major concern is with regards to multiple guest processes trying to access the PCI cards. For example, in the case of multiple processes running on Linux which need to access the NIC, since the driver is in the kernel, it can take the requests from each application, do the necessary work and dispatch them. In the case of vfio-pci, multiple user processes could open the /dev/vfio/vfio and /dev/vfio/$GROUP device and then map it to their own virtual address spaces and then start sending data to the NIC. How do you ensure that they synchronize properly? I am talking here with respect to launching multiple guest Mirage processes on the same processor and each one wants to access the NIC. In that case, I am not sure whether you can have global page mappings for each process corresponding to the NIC or something similar and also how to ensure synchronization.

Anyways, the above concern is more on the practical side of things, which I am not sure you intend to address in your research. Finally, RISC-V does not have a complete virtualization story yet. There do exist draft Hypervisor specifications and ports for KVM underway, but nothing on the side of IO virtualization yet. I am open to questions for the RISC-V state of the union, if required.

bonkf commented 5 years ago

When you write a userspace PCIE driver against vfio-pci, you still have to manage the PCI device, i.e. mapping the configuration registers and BARS into the address space of the process and doing the reads/writes according to the device specification?

I was going to map everything the device offers into the unikernel's address space and be done with it. That's what I meant with having unikernels deal with the intricacies. I actually don't know how hypervisors like Xen targeting "real" OSs deal with PCIe; I just assumed that VMs would have to do all the same setup steps (bus enumeration, register mappings, etc.) as bare-metal OSs.

I figured it'd be easier for unikernel developers to just have PCIe devices "magically" show up in a fixed place (like an array at address 0xwhatever containing pointers to each device's registers). Alternatively I could add a hypercall hvt_activate_pci_mappings(), so unikernels unaware of PCIe can't mess anything up by accidentally accessing the mapped registers.

I have thought a bit about multiple processes accessing the same device, though admittedly not too much. For my initial prototype I was going to take the "user must make sure to only use each device once" stance.

I am talking here with respect to launching multiple guest Mirage processes on the same processor and each one wants to access the NIC.

That ain't happening: one unikernel will take full control of the device. The others will have to use different NICs. I don't see an easy solution for this in the general case; with modern NICs specifically you could theoretically use different rx/tx queues for each unikernel and have the NIC multiplex in hardware, but then the driver would have to run either in Solo5 (@mato is screaming already ;-)) or one unikernel would have to configure the NIC for all the others. Don't think that's viable. I guess in the distant future we could have something of a "plug-in" unikernel that exposes functionality to other unikernels in a structured fashion.

TImada commented 5 years ago

That ain't happening: one unikernel will take full control of the device. The others will have to use different NICs. I don't see an easy solution for this in the general case;

I think a general approach for sharing a PCIe device is SR-IOV and VFs (Virtual Functions). Each MirageOS Unikernel having a device driver can manage an assigned VF (i.e. virtualized device) independently without any breach.

anmolsahoo25 commented 5 years ago

That sounds about right. As far as I have understood, the Xen pci-front/pci-back does something similar to vfio-pci, where it would expose just the configuration space and BAR's to your application, mapped as memory addresses in the process and you will not have to do the bus-enumeration etc.

The following is equivalent to binding the vfio-pci driver in KVM/Linux.

Then you need to assign the device to pciback instead of its normal driver in dom0

This is how a virtual guest would access the device -

HVM guests require no special configuration for the guest kernel, as all accesses are emulated and virtualized by the IOMMU hardware. PV guests need the xen-pcifront module (just 'pcifront' for classic Xen kernels).

So it would seem that you'd have to bind the xen-pcifront code in your unikernel (if your guest is PV) and you should be good to go!

Though, I still lack the necessary expertise in this area, and it would be preferable if you consult with someone who understands these intricacies better!

anmolsahoo25 commented 5 years ago

Also, my comment about multiple user processes accessing the NIC was raised only in the context of Jitsu, where you spawn processes for each request. I couldn't think of a clear way of setting such a system up, given each unikernel would handle the networking stack as well, thus I asked the question out to get some ideas!

Anyways, I think virtualization is an orthogonal issue. Given once we have unikernels which take care of the network stack, the virtualization can be inspected. As @TImada suggests, SR-IOV could be one way to do it among others!

I am definitely intrigued by dedicated processes existing to handle the NIC's, maybe multiplexing over cards using SR-IOV's and worker processes communicating their IP requests to each of these, sort of like a load balancing setup for multi-NIC systems! Though, I think this is veering off topic and is a discussion for some other issue!

hannesm commented 5 years ago

Thanks for this interesting discussion.

multiple guest processes trying to access the PCI cards

not sure why this should be solo5 responsibility -- for tap devices it is not atm (but instead the responsibility of the surrounding orchestration system (e.g. your shell scripts)), which I think is a very fine solution.

Kensan commented 5 years ago

I figured it'd be easier for unikernel developers to just have PCIe devices "magically" show up in a fixed place (like an array at address 0xwhatever containing pointers to each device's registers).

This is what we do for native Ada/SPARK components on Muen. Since the whole system is static, we can have these addresses as constants at compilation time. For subjects that do resource discovery during runtime/startup we have a mechanism called Subject Info which enables querying e.g. memory mappings and their attributes (rwx) for a given name. This facility is already used in the Solo5/Muen bindings to get the memory channels for memory stream network devices, see https://github.com/Solo5/solo5/blob/6a98a1938a9dcb0cb7a3e92d7e56d5019339d083/bindings/muen/muen-net.c#L159. If Solo5 provided a way for Unikernels to query similar information, the driver could setup its memory mappings initially and then start operation.

Another idea worth investigating could be how the new manifest could be leveraged to fit this use case.

mato commented 5 years ago

To summarize some discussion I've already had with @Reperator and @Kensan elsewhere: From the Solo5 point of view, ease of portability, isolation and a clear separation of concerns are important goals if this is to be eventually up-streamed.

With that in mind, I would aim for a minimal implementation that starts with the hvt target on Linux/KVM only, uses the most modern Linux interfaces available (hence vfio) and requires an IOMMU for isolation. At the same time, we should try and design the Solo5 interfaces (public and "internal") in such a way that they could be implemented for non-Linux hosts such as Muen.

Roughly, the implementation should:

  1. Tender-side:
    1. Implement a new module, say pci,
    2. Define a PCI_BASIC device of some form in the manifest.
    3. Implement the needed hypercalls in the tender/bindings interface, if any. We may not actually need any, since in the process of attaching the device the tender should set up everything required and pass the relevant information needed to the unikernel by filling in a struct mft_pci_basic and/or adding data to struct hvt_boot_info if required.
    4. Implement --pci-dev:NAME=DEVICE to attach a PCI_BASIC device and --pci-dma:NAME=N to provide DMA memory (but see below).
  2. Bindings-side:
    1. Implement a solo5_pci_acquire() API which would return the relevant information to the unikernel.

Regarding the requirement for DMA memory: You're right that requiring the operator to specify the "right" amount of DMA memory when starting the tender is somewhat unintuitive and fragile. When I was designing the manifest I considered defining entries for other resources, such as memory, but did not go down that route since I wanted to push out the user-visible features enabled by the manifest sooner rather than later.

In order to understand how this could be designed, I have a couple of questions:

  1. If a unikernel were to control multiple PCI devices in this fashion, does each device get its own DMA memory mapping? I.e. is the DMA memory mapping essentially a property of the device?
  2. If we make the unikernel (driver) developer specify the DMA memory mapping size up-front (i.e. it goes into the manifest in some form), are there cases where the operator would want to override the DMA mapping size?
bonkf commented 5 years ago

Thanks for the summary!

Regarding your questions:

  1. Generally no. There are cases where it'd be useful to have devices talk to each other directly via a common mapping. ixy.ml currently maps some specific memory for each NIC it controls and then some common memory where it stores packet buffers that may be used by any of its NICs. If we went down the route that each device may only use its own DMA memory, we'd have to copy lots of data between these areas. For example if we were to implement a router to forward packets from NIC A to NIC B, NIC A's receive buffers would have to be copied into NIC B's transmit buffers (instead of just passing NIC B a pointer to NIC A's receive buffer).
  2. Yes. Yet again I'm arguing from the perspective of ixy.ml: ixy.ml can theoretically be extended to work in as big or as small a DMA memory footprint as we want (I've just hardcoded some numbers for now). There is some minimum requirement for ring buffers (currently I just use 2MiB per NIC as this is the size of a huge page on x86_64), but the amount of packet buffers floating around the system could theoretically be any number; using at least 2 MiB (huge page size)/2 KiB (packet buffer size) = 1024 makes sense as not to waste any space in the huge page.
mato commented 5 years ago

@Reperator:

  1. Ok, so if I read you correctly, "DMA memory" is a global system resource. Presumably with system-wide limits, and so on. Sorry if this is obvious, but I'm not an expert on the details of hardware pass-through. Can you point me to some example code of how such memory is allocated, on a modern Linux system with IOMMU and vfio?

  2. The values should definitely be configurable. I guess what I'm asking is, for, say, ixy.ml's use-case, are there reasonable defaults and what are they? If there are reasonable defaults, why would we run into the situation you mention in https://github.com/Solo5/solo5/issues/425#issue-509943176 "Otherwise driver authors would be forced to do something like this when these requirements aren't met:"?

bonkf commented 5 years ago

@mato:

  1. I have to add that ixy.ml currently uses Linux's hugetlbfs and no IOMMU. So DMA memory has the system wide limit of "how many hugetlbfs pages has the user created?". Using the IOMMU allows mapping arbitrary memory (allocated using mmap with MAP_HUGETLB), so there's no real system wide limit besides the IOMMU's TLB which we don't want to overload. If there are more mappings active than the TLB can cache, performance gets significantly worse. Some example code that uses the IOMMU and vfio would be ixy's memory.c. This is just allocation; the actual mapping happens here.
  2. The defaults that are hardcoded into ixy.ml are reasonable imo. ixy.ml will attempt to satisfy those defaults (i.e. allocate 2 MiB per NIC and 8 MiB per mempool of which each NIC gets one by default, so technically 10 MiB per NIC + 8 MiB per additional mempool) and fail if it can't allocate enough memory. If the unikernel within which ixy.ml runs didn't have enough memory mapped to satisfy those hardcoded defaults, ixy.ml would also have to fail. The situation can only occur if the user gets to choose the amount of DMA memory and is allowed to reduce it below the minimum amount. The problem is that the minimum amount of memory depends on the number of NICs ixy.ml controls.
mato commented 4 years ago

@Reperator:

OK, looking at how the VFIO mapping works it's just an ioctl on top of an mmap. So, I'd suggest starting with a minimal "strawman" PoC for hvt where, if a PCI_BASIC device is requested the tender creates a single DMA memory region with a hardcoded size and passes its details to bindings, say via struct hvt_boot_info for now.

Once you verify that this approach actually works, we can then proceed figuring out what the "real" APIs and implementation should be.

anmolsahoo25 commented 4 years ago

On a very obtuse note, do you think some sort of AVX/AVX2 codegen capabilities in OCaml would benefit your work? I am assuming that in case you want to do some SIMD computation you'd have to bind to a C library and my initial impression was that you would like to keep it to as minimum as possible.

I had read in a Reddit thread that DPDK drivers are faster due to AVX/AVX2 but also completely unreadable. So I imagine that being able to use AVX/AVX2 from high level constructs might be useful.

bonkf commented 4 years ago

@anmolsahoo25: Possibly. But there are definitely lower-hanging fruits in the MirageOS network stack (zero-copy, batching).

anmolsahoo25 commented 4 years ago

Makes sense. Thanks!