Open reynir opened 1 month ago
I reviewed the construction of the new page table mappings and they look good to me.
While looking at the construction of hvt->mem
I noticed that one assumption is that the underlying memory is zero initialised.
hvt_kvm.c
and hvt_openbsd.c
, this is the case since mmap
with MAP_ANONYMOUS
is guaranteed to return zero'ed memory.hvt_freebsd
, the mmap
call is only MAP_SHARED
. Maybe somebody familiar with FreeBSD can comment on the guarantees that are given in this case.Interesting observation. I have used this patch for some unikernels on FreeBSD, it worked nicely. But I don't quite understand the semantics (and FreeBSD mmap(2) man page doesn't guarantee zeroed out memory (also not for MAP_ANONYMOUS).
tenders/hvt/hvt_freebsd.c: hvt->mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, hvb->vmfd, 0);
tenders/hvt/hvt_kvm.c: hvt->mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
tenders/hvt/hvt_openbsd.c: p = mmap(NULL, vmr->vmr_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
So, why does (a) FreeBSD use the vmfd (/dev/vmm/solo5-
I think this is ready to review. I added a minor comment why X86_PDE_SIZE
is now four times as much. I think there could be some explanation on how we build these tables and maybe a reference to their documentation. Then again, maybe it is self-evident if you're familiar with them already. Maybe @Kensan has an opinion?
Another question is "why 4 GB?" -- and I think the answer is partly for simplicity, and maybe partly because this is the limit for aarch64 (for reasons I don't think apply to x86).
Interesting observation. I have used this patch for some unikernels on FreeBSD, it worked nicely. But I don't quite understand the semantics (and FreeBSD mmap(2) man page doesn't guarantee zeroed out memory (also not for MAP_ANONYMOUS).
According to David Chisnall, the current usage guarantees zeroized memory, see this thread.
I think this is ready to review. I added a minor comment why
X86_PDE_SIZE
is now four times as much. I think there could be some explanation on how we build these tables and maybe a reference to their documentation. Then again, maybe it is self-evident if you're familiar with them already. Maybe @Kensan has an opinion?
The most helpful reference would be Intel SDM Volume 3A, section 4.5 4-Level Paging and 5-Level Paging and Figure 4-9. Linear-Address Translation to a 2-MByte Page using 4-Level Paging. I think it is helpful to have this reference somewhere. If not in the code then at least in one of the commit messages but a comment would probably preferable.
Another question is "why 4 GB?" -- and I think the answer is partly for simplicity, and maybe partly because this is the limit for aarch64 (for reasons I don't think apply to x86).
I do not know what the reasoning was but probably that it is a nice round number and "should be enough for almost anyone" (tm). On x86_64
it is easy to add 1-GB by simply adding another, fully populated Page Directory (PD).
Since there were no real code changes, my conclusion from the first review still stands and from my side I think it is looking good.
The following is a commit done by @mato. I am creating this draft PR to open up for review and hopefully eventually have it merged. I may add comments or changes that I find helpful in understanding the code.
Allow for up to 4GB guest size on x86_64 by using up to four PDE entries.
TODO: Lightly tested only, not sure if this arrangement will conflict with any platform memory holes that "plain" KVM may map into guest memory.