Solo5 / solo5

A sandboxed execution environment for unikernels
ISC License
883 stars 136 forks source link

WIP: hvt: x86_64: Allow for up to 4GB guest size #577

Open reynir opened 1 month ago

reynir commented 1 month ago

The following is a commit done by @mato. I am creating this draft PR to open up for review and hopefully eventually have it merged. I may add comments or changes that I find helpful in understanding the code.


Allow for up to 4GB guest size on x86_64 by using up to four PDE entries.

TODO: Lightly tested only, not sure if this arrangement will conflict with any platform memory holes that "plain" KVM may map into guest memory.

Kensan commented 1 month ago

I reviewed the construction of the new page table mappings and they look good to me.

While looking at the construction of hvt->mem I noticed that one assumption is that the underlying memory is zero initialised.

hannesm commented 1 month ago

Interesting observation. I have used this patch for some unikernels on FreeBSD, it worked nicely. But I don't quite understand the semantics (and FreeBSD mmap(2) man page doesn't guarantee zeroed out memory (also not for MAP_ANONYMOUS).

tenders/hvt/hvt_freebsd.c: hvt->mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, hvb->vmfd, 0);

tenders/hvt/hvt_kvm.c: hvt->mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

tenders/hvt/hvt_openbsd.c: p = mmap(NULL, vmr->vmr_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);

So, why does (a) FreeBSD use the vmfd (/dev/vmm/solo5-)? Why does OpenBSD use MAP_PRIVATE (instead of MAP_SHARED)? Is it worth to unify the flags across the platforms?

reynir commented 1 month ago

I think this is ready to review. I added a minor comment why X86_PDE_SIZE is now four times as much. I think there could be some explanation on how we build these tables and maybe a reference to their documentation. Then again, maybe it is self-evident if you're familiar with them already. Maybe @Kensan has an opinion?

Another question is "why 4 GB?" -- and I think the answer is partly for simplicity, and maybe partly because this is the limit for aarch64 (for reasons I don't think apply to x86).

Kensan commented 3 weeks ago

Interesting observation. I have used this patch for some unikernels on FreeBSD, it worked nicely. But I don't quite understand the semantics (and FreeBSD mmap(2) man page doesn't guarantee zeroed out memory (also not for MAP_ANONYMOUS).

According to David Chisnall, the current usage guarantees zeroized memory, see this thread.

Kensan commented 3 weeks ago

I think this is ready to review. I added a minor comment why X86_PDE_SIZE is now four times as much. I think there could be some explanation on how we build these tables and maybe a reference to their documentation. Then again, maybe it is self-evident if you're familiar with them already. Maybe @Kensan has an opinion?

The most helpful reference would be Intel SDM Volume 3A, section 4.5 4-Level Paging and 5-Level Paging and Figure 4-9. Linear-Address Translation to a 2-MByte Page using 4-Level Paging. I think it is helpful to have this reference somewhere. If not in the code then at least in one of the commit messages but a comment would probably preferable.

Another question is "why 4 GB?" -- and I think the answer is partly for simplicity, and maybe partly because this is the limit for aarch64 (for reasons I don't think apply to x86).

I do not know what the reasoning was but probably that it is a nice round number and "should be enough for almost anyone" (tm). On x86_64 it is easy to add 1-GB by simply adding another, fully populated Page Directory (PD).

Since there were no real code changes, my conclusion from the first review still stands and from my side I think it is looking good.