Unable to use HPA >= 4GB when vTLB is active

udosteinberg commented 11 years ago

This is not really an NRE bug, but something that NRE needs to be aware of, since we have seen 64-bit NRE use high addresses for VMs, which has exposed the problem...

For 32-bit guests, the vTLB uses a 2-level shadow page table with 4-byte PTEs. Even though a 64-bit VMM can install a GPA-to-HPA mapping where HPA >= 4GB, the vTLB cannot store HPA wider than 32-bit in its shadow PTEs. The most recent version of the microhypervisor catches such cases and terminates the vCPU.

To be able to make use of HPA beyond 4GB when the vTLB is active, the microhypervisor would have to use a 3-level PAE shadow page table with 8-byte PTEs, which has performance implications.

For the time being, NRE should avoid using HPA >= 4GB for VMs that use the vTLB.

The vTLB is used when "vtlb" is specified on the hypervisor command line. Otherwise it is used...

always, for CPUs before Nehalem
when the guest runs without paging, for CPUs before Westmere
never, for Westmere and newer

hrniels commented 11 years ago

Ok, thanks for the notice and the explanation. I'll take a look at that if I have time.

udosteinberg commented 11 years ago

A patch for this is in the vtlb branch of my tree. @Nils-TUD @parthy feel free to give this a spin. You should boot the hypervisor with "vtlb" parameter to force vTLB operation and map some host-physical region beyond 4GB into the guest (obviously you'd need a 64bit host system for that). Note that this causes 3-level PAE paging to be used instead of 2-level paging, so I would expect some performance impact. It would be good to run some benchmark with vTLB enabled, once with and once without the patch to see how much overhead we add for this.

parthy commented 11 years ago

I ran a single-core kernel compile today with vanilla NRE (without the hypervisor changes) and the two vTLB versions. I got 522.4s for the old version and 562.6s for the one. That would be around 7.6% overhead introduced by the patch. It would be interesting to see some other machine's results, too.

blitz commented 11 years ago

What machine are you using? If you configure a boot entry on erwin, I can test it on Ivy Bridge.

parthy commented 11 years ago

I'm on a SandyBridge. I'm not sure I can provide such a boot entry easily.. I just build NRE with HAVE_KERNEL_EXTENSIONS set to 0 and change the pulsar config it generates when building boot/vancouver-kernelbuild.wv to use the official NOVA hypervisor binary. In NOVA, I change the aux field to be an mword in include/hip.hpp and build it as 64bit binary, once in master, once in the vtlb branch. I'm sure you will get this running a lot faster than me trying to figure out how to set up this boot entry and copying all the stuff over (which I can do only from home, in the evening).

udosteinberg commented 11 years ago

@blitz - it would be more interesting to run this on an old machine, because for the new ones with EPT, you wouldn't want to use vTLB anyway. I'd be more interested in performance numbers for P4, Yonah, Merom, Penryn machines. AFAIR Carsten had a P4-based Presler machine (CPUID F:6:2), where vTLB would be mandatory. Nonetheless, performance numbers for newer machines are welcome as well, to get a wider picture.

blitz commented 11 years ago

@udosteinberg I'll see what I can find. @parthy Can you mail me the pulsar config and all the binaries? (Yes, I am lazy.)

Nils-TUD commented 11 years ago

I've just tested it with my i5:

 [ 0] CORE:0:0:0 6:25:5:1 [2] Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz

The old version needs 615s and the new version 656s. That's an overhead of 6.67%.

blitz commented 11 years ago

So far I only got it to work on my laptop (i7 L640: 770s -> 819s 6%). For vtlb-only boxes, I could only find a Pentium D, but I couldn't get the benchmark to run right away. Working on it.

udosteinberg commented 11 years ago

Based on the reported numbers so far, overhead seems to be in the 6-8 percent range. Because that is significant, I would rather not enable PAE for the vTLB by default. We could make it a compile-time option. What do others think?

blitz commented 11 years ago

Is it possible to make it a runtime option?

udosteinberg commented 11 years ago

Sure, but it would inflate the binary because the code would have to include both vTLB versions. We could then even go as far as using the 2-level vTLB for VMs with HPA below 4GB and the 3-level vTLB for those with HPA above 4GB. I'm reluctant to do that because it results in non-deterministic VM performance for all sorts of benchmarks.

blitz commented 11 years ago

On 08/03/2013 01:51 AM, Udo Steinberg wrote:

Sure, but it would inflate the binary because the code would have to include both vTLB versions.

It would be nice, if NOVA Just Works™ even on old boxes. Maybe have the slow (but working) variant be the default and print a big warning that you can switch to a faster version with restrictions with a compiler switch?

Julian

parthy commented 11 years ago

It's not that you can't use NOVA on those machines ;) You just cannot assign memory beyond 4GiB (physical) to the guest. It is a limitation that you can circumvent, but not a killer. So to me it would make sense to do it the other way round: Only have a special case if you want to compile NOVA for the use on such an older machine with >4GiB RAM. Maybe you could have a compile switch saying "build and use both both vTLB versions"? With that, the default version advocated by the README would still be deterministic in this sense, but if you choose to overcome the limitation, you get to use higher addresses and only pay the performance overhead where necessary, but have to compile the kernel differently.

blitz commented 11 years ago

It only works if a) the box has no memory beyond 4G or b) the userspace is aware of this limitation and will not use memory beyond 4G for VMs.

udosteinberg commented 11 years ago

Another option would be to configure the vTLB as follows:

x86_32 - uses 2-level vTLB
x86_64 - uses 3-level PAE vTLB

This would work because the 32bit API does not support addresses beyond 4GB anyway. So 32bit vTLB will be ~7% faster than 64bit vTLB, and performance will be deterministic for all VMs. Another benefit is that this configuration would exercise both vTLB variants regularly.

blitz commented 11 years ago

Sounds reasonable.

udosteinberg commented 11 years ago

@parthy @blitz @Nils-TUD - Could you rerun the previous kernel compile benchmark using the updated vtlb branch? Same configuration as last time: 64bit hypervisor, vTLB force-enabled. I would expect overhead to come down a bit.

blitz commented 11 years ago

I had to increase kernel memory to avoid out of memory on NRE bootstrap for both versions. Weird. Overhead stays at roughly 6%. I can't see a difference to the older patch.

udosteinberg commented 11 years ago

The older patch did not flag the guest's global pages correctly, thus address space switches would flush them from the vTLB. So now that we can keep them, I would expect decreasing overheads. But maybe Linux' use of global pages is not very significant.

parthy commented 11 years ago

I can see a slight, but unstable improvement of 0.3-1%. So I would not expect a noticeable difference. Or at least not in a kernel compile, maybe we should try a different workload as well?

udosteinberg commented 11 years ago

It should be quite a bit more noticeable if you run L4/Fiasco + Pingpong in a VM on top of NOVA. Especially the Inter-AS benchmark should see some improvement.

Nils-TUD commented 11 years ago

For me it doesn't make any difference. I've run the kernel-compile-test twice and both times it took 656s, i.e. the same time as with the previous vtlb-version.

udosteinberg commented 11 years ago

I've just pushed another version of the vtlb branch. This time it should really make a difference.

parthy commented 11 years ago

With this version, I get almost exactly the same results as without PAE.

Nils-TUD commented 11 years ago

The same here. Without PAE it takes 615s, with PAE 618s :)

Nils-TUD commented 11 years ago

The patch seems to have introduced a bug, though. If I try to boot escape, it hangs in an endless vTLB-miss-loop @ 0xc0131623 (see nre/dist/imgs/escape.bin). This is the instruction behind the one that enables paging. This does not occur with the master branch of NOVA.

udosteinberg commented 11 years ago

How do I get the escape binaries? They are not in my tree.

Nils-TUD commented 11 years ago

Just execute ./dist/download.sh in the directory nre.

Nils-TUD commented 11 years ago

If it matters...I've tested it with qemu. So, for example by ./b qemu boot/vmmng.

udosteinberg commented 11 years ago

For some reason NRE does not seem to find its ISO image in the file system...

[9] RESET device state
[9] reset CPU from 9 mtr_in e0010
[9] >   bool VirtualBiosMultiboot::receive(MessageBios&) rip ffff ilen 0 cr0 10 efl 2
[9]     module 0 start 0000:0004:01c0:0000+5f492 cmdline          escape/escape.bin videomode=vga
[9]     module 1 start 0000:0004:01c0:2000+161ea cmdline escape/escape_romdisk.bin /dev/romdisk escape/escape.iso
[9]     module 2 start 0000:0004:01c1:9000+16319 cmdline           escape/escape_rtc.bin /dev/rtc
[9]     module 3 start 0000:0004:01c3:0000+22b35 cmdline escape/escape_fs.bin /dev/fs /dev/romdisk iso9660
[9]     module 4 start 0000:0004:01c5:3000+74d000 cmdline                        escape/escape.iso
[9] #   Initializing dynarray...                                                    done
[9] # |   Initializing SMP...1 CPUs found                                             done
[9] # |   Initializing GDT...                                                         done
[9] # |   Initializing CPU...Detected 3296 Mhz CPU                                    done
[9] # |   Initializing FPU...                                                         done
[9] # |   Initializing VFS...                                                         done
[9] # |   Initializing event system...                                                done
[9] # |   Initializing processes...                                                   done
[9] # |   Initializing scheduler...                                                   done
[9] # |   Initializing terminator...                                                  done
[9] # |   Start logging to VFS...                                                     done
[9] # |   Initializing virtual memory-management...                                   done
[9] # |   Initializing copy-on-write...                                               done
[9] # |   Initializing interrupts...                                                  done
[9] # |   Initializing PIC...                                                         done
[9] # |   Initializing IDT...                                                         done
[9] # |   Initializing timer...                                                       done
[9] # |   Initializing signal handling...                                             done
[9] # |   13435 free frames (53740 KiB)
[9] #   Unable to stat 'escape/escape.iso': No such file or directory

The vmconfig file looks like this...

nre/vancouver m:64 ncpu:1 PC_PS2
escape/escape.bin videomode=vga
escape/escape_romdisk.bin /dev/romdisk escape/escape.iso
escape/escape_rtc.bin /dev/rtc
escape/escape_fs.bin /dev/fs /dev/romdisk iso9660
escape/escape.iso

Nils-TUD commented 11 years ago

That's the output of Escape, which isn't able to find the ROM-disk, i.e. the ISO-image here. But I'm wondering where you have the config-file from. Because the one in boot/vmmng in the NRE repo looks different. It should be:

rom://bin/apps/vancouver m:64 ncpu:1 vga_fbsize:4096 PC_PS2
rom://dist/imgs/escape.bin
rom://dist/imgs/escape_romdisk.bin /dev/romdisk /system/mbmods/3
rom://dist/imgs/escape_rtc.bin /dev/rtc
rom://dist/imgs/escape_fs.bin /dev/fs /dev/romdisk iso9660
rom://dist/imgs/escape.iso

udosteinberg commented 11 years ago

The patch seems to have introduced a bug, though. If I try to boot escape, it hangs in an endless vTLB-miss-loop @ 0xc0131623 (see nre/dist/imgs/escape.bin). This is the instruction behind the one that enables paging. This does not occur with the master branch of NOVA.

NOVA gets the TLB miss address from VMCB->exitinfo2 in src/ec_svm.cpp. Since we're running a 32bit guest, I would expect exitinfo2[63..32] = 0. However, I'm seeing upper bits set in QEMU, which causes the problem. If you change the line to truncate cr2 to 32 bit, it seems to work here. Can you confirm that?

It would be interesting to see what real AMD HW does. Obviously I don't have any here. Or if anyone can dig up the relevant part of the SVM spec that clarifies what happens to the high bits of exitinfo2 in 32bit mode, that would also help.

Nils-TUD commented 11 years ago

Interesting. It's the same here on qemu. Unfortunatly, I don't have an AMD box either.

udosteinberg commented 11 years ago

FWIW, NRE + Escape works on Intel CPUs with vTLB. So it's clearly related to the SVM interface code. It shouldn't have anything to do with the recent vTLB changes and older versions should show the same symptoms.

udosteinberg commented 11 years ago

According to the recollection of an SVM architect, the intercepted page-fault VM exit should write to all 64 bits of the exitinfo2 field in the VMCB. Could someone with AMD hardware please verify that this is the case?

Just add assert (cr2 >> 32 == 0); to ec_svm.cpp:63, build a hypervisor image with assertions enabled via make -C build ARCH=x86_64 DEFINES=DEBUG and run a 32bit Escape guest on a 64bit hypervisor with VTLB force-enabled.

It will blow up in QEMU, and the question is whether it survives on AMD hardware. If it does, then we may need to file a QEMU bug.

blitz commented 11 years ago

I'll look for one and report back.

Nils-TUD commented 11 years ago

I've just seen that Björn has an AMD box here:

[ 0] CORE:0:0:0 10:2:3:0 [0] AMD Phenom(tm) 8450 Triple-Core Processor

Therefore, I've tested it now and it works, i.e. the upper bits of cr2 are always zero.

blitz commented 11 years ago

So this is a qemu bug?

udosteinberg commented 11 years ago

If the very same setup works on real HW and does not work in QEMU, then it is indeed a QEMU bug. Interestingly enough the problem is only exposed by Escape running as guest OS. I have not spent the time figuring out why, but I suspect that Escape causes some VM exit that sets high bits in exitinfo2 and those bits then remain set during vTLB-related page-fault exits. Linux does not expose the issue.

So you really need to test identical setups with Escape as guest OS.

udosteinberg commented 11 years ago

It's not some other VM exit, it's the page-fault exit itself. With Escape I'm seeing several instances of...

[ 1] VM exit 0x4e set exitinfo2 to 0x1001316b0
[ 1] VM exit 0x4e set exitinfo2 to 0x100126dc0

The relevant code in QEMU seems to be

 if (env->intercept_exceptions & (1 << EXCP0E_PAGE)) {
        /* cr2 is not modified in case of exceptions */
        stq_phys(env->vm_vmcb + offsetof(struct vmcb, control.exit_info_2),
                 addr);

in qemu/target-i386/helper.c.

Nils-TUD commented 11 years ago

I guess the reason why it does not occur with e.g. Linux is that Linux does not use this "GDT trick" as Escape does. Because Escape configures the GDT to have a base address of 0x40000000 so that the virtual address 0xC0000000 ends up at physical address 0x0. If you look at the addresses in exitinfo1 you see that bit 32 is set: 0x00000001001316b0. So, qemu doesn't handle the overflow as it should with a 32-bit guest.

udosteinberg commented 11 years ago

Yep, that sounds like a reasonable explanation for what we're seeing. So feel free to open a QEMU bug for this and maybe put a link here for future references.

Nils-TUD commented 11 years ago

I've filed a bug: https://bugs.launchpad.net/qemu/+bug/1211910

TUD-OS / NRE

Unable to use HPA >= 4GB when vTLB is active #33