Huge page support for guest memory

pclesr commented 3 years ago

Why is this feature request important? What are the use cases? Please describe.

Page faulting can increase startup time. In the development of Nitro Hypervisor support for arm, supporting huge pages for guest memory made the difference between hitting the target SPEC numbers and not. Since there was no hugepage fs in Nitro Hypervisor, the only alternative was to make the kernel support huge pud/pmd for arm. I would not advocate that route, rather have the ability to use a mounted hugetblfs.

For embedded environments where everything is restricted, being able to allocate a specific number of huge pages at boot that can be used for guests would decrease the startup time of the app by reducing faults.

Describe the desired solution

Since not every environment will have huge pages, either a build-time or run time option to have guest memory that is specified in the KVM_SET_USER_MEMORY_REGION ioctl be backed by huge pages. I would propose using the hugetlbfs and mmap() to alloc the anon memory. Since I barely know Rust, I can't tell how guest memory is currently allocated, but I would assume it is some mmap() call.

Describe possible alternatives

perf tracing shows that a lot of time is spent page faulting, especially in arm. Obviously it works without having huge pages, but it could reduce not only startup time, but fault time as well.

Additional context

Running on a poor, resource-starved 2-core A57 in an embedded environment. Since it's embedded, there is control over everything. The goal is to reduce startup time and overhead of the KVM calls.

Checks

[ ] Have you searched the Firecracker Issues database for similar requests?
[ ] Have you read all the existing relevant Firecracker documentation?
[ ] Have you read and understood Firecracker's core tenets?

pclesr commented 3 years ago

As an additional data point, when running on a two core A57, putting the kernel image and initrd into a hugetlb filesystem resulted in a 20% improvement. When running 'perf -ag' and looking at where the time is being spent, it was no longer spending all of it's time faulting from devices::virtio::block::device::Block::process_queue. It still spends a lot of time handling faults from process_queue(), but at least it is no longer at the very top of the perf output.

Also, the console on arm is extremely expensive; 'quiet' on the kernel command line is your friend.

iulianbarbu commented 3 years ago

Hi, @pclesr ! Sorry for the delay and thanks for logging this feature request. I think we are interested in exposing such capabilities. We're currently using the guest memory primitives from rust-vmm/vm-memory. We need to contribute there first with huge pages support and release a new vm-memory version which can be consumed by Firecracker.

I've opened an issue on rust-vmm/vm-memory. We'll keep this issue here to track the progress on rust-vmm/memory and the discussions around Firecracker consumption/exposing of the feature.

EmeraldShift commented 3 years ago

Hi, @iulianbarbu, I responded to the issue you opened on rust-vmm/vm-memory, expressing interest in that issue. We'd also like to work on the Firecracker end of that feature, too. Is there more information/context you can provide to help us get started?

serban300 commented 3 years ago

Hi @EmeraldShift ! Personally I think on the Firecracker end the issue would be how exactly to expose the feature to the customer. Should it be an API call ? Or should it be something else ? We haven't discussed anything yet. But anyway, we should wait for the rust-vmm/vm-memory implementation first. There might be some aspects that will depend on the design that will be adopted there.

pclesr commented 3 years ago

I did a simple proof of concept by patching MmapRegion() in src/mmap_unix.rs and saw that page faults went through the hugetlb handler (verified by running perf and looking at the kernel stacks).

diff --git a/src/mmap_unix.rs b/src/mmap_unix.rs
index 5d23de0..f0983d1 100644
--- a/src/mmap_unix.rs
+++ b/src/mmap_unix.rs
@@ -109,7 +109,7 @@ impl MmapRegion {
             None,
             size,
             libc::PROT_READ | libc::PROT_WRITE,
-            libc::MAP_ANONYMOUS | libc::MAP_NORESERVE | libc::MAP_PRIVATE,
+            libc::MAP_ANONYMOUS | libc::MAP_NORESERVE | libc::MAP_PRIVATE | libc::MAP_HUGETLB,
         )
     }

Obviously, this is not generic, but I wanted to just see if it would work.

One possibility for firecracker would be an option in the machine config that would control whether the pages for guest memory were backed by huge pages or normal.

roypat commented 5 months ago

Hey all, We have added support for backing guest memory by 2M hugetlb pages with Firecracker 1.7. Please also see https://github.com/firecracker-microvm/firecracker/pull/4360 and https://github.com/firecracker-microvm/firecracker/blob/main/docs/hugepages.md. I'm keeping this issue open to track that hugepages support is in developer preview for now, so please let us know if you have any feedback on the feature!

firecracker-microvm / firecracker