Gbps / gbhv

Simple x86-64 VT-x Hypervisor with EPT Hooking
Creative Commons Attribution 4.0 International
845 stars 143 forks source link

VMLaunch hangs when enabling rdtsc exiting #25

Open 3216alec opened 3 years ago

3216alec commented 3 years ago

When I enable rdtsc exiting within the primary processor controls of the VMCS, my entire system hangs (tested on both VM & baremetal).

I am guessing that there is some conflict within the VMCS with rdtsc exiting and some other control, but I have read through the Intel SDM and looked for all references of the rdtsc exiting field, and there seems to be no information on control conflicts or requirements for rdtsc exiting other than that rdtsc offsetting must be disabled (which it is).

I have hit my experience cap when it comes to hypervisor development, and I am searching for further assistance. Thanks. Screenshot_2

Gbps commented 3 years ago

Hey there. I appreciate the detailed issue report. I will give it some thinking and try to reproduce it when I get a chance this week and get back to you.

My thought are that there are some interrupt/IRQL issues afoot, since rdtsc is pretty integral to scheduling. It isn't clear to me why it wouldn't give you an exit though, so I'll take a look and see if I can get a debug breakpoint at a good time.

I'll get back to you with more info!

On Mon, May 3, 2021, 11:58 PM 3216alec @.***> wrote:

When I enable rdtsc exiting within the primary processor controls of the VMCS, my entire system hangs (tested on both VM & baremetal).

  • I have tried using a fresh download of gbhv to rule out any of my changes to the project.
  • I have created a handler for rdtsc exiting, which never gets triggered. In fact, no exits occur overall. (So vmlaunch is definitely hanging/erroring??)
  • I have tried installing a debug break directly before the vmlaunch and tracing the vmlaunch, however my entire system hangs and WinDbg cannot trace through the vmlaunch.
  • I have downloaded a separate hypervisor (HyperPlatform) and enabled rdtsc exiting, which worked as intended

I am guessing that there is some conflict within the VMCS with rdtsc exiting and some other control, but I have read through the Intel SDM and looked for all references of the rdtsc exiting field, and there seems to be no information on control conflicts or requirements for rdtsc exiting other than that rdtsc offsetting must be disabled (which it is).

I have hit my experience cap when it comes to hypervisor development, and I am searching for further assistance. Thanks. [image: Screenshot_2] https://user-images.githubusercontent.com/47839087/116959219-f57c3180-ac6a-11eb-86b5-8a70fc0d608f.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gbps/gbhv/issues/25, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKZAKH2QP5XCNKOD3WPRADTL5WFTANCNFSM44B6CJLA .

Sherman0236 commented 3 years ago

I was able to replicate the issue within VMWare and create + dump a snapshot while the system was stuck. I got this interesting result: https://pastebin.com/X9vgqfPn It appears to have something to do with NMI & trapping. I will add an NMI handler and get back with the results soonTM.

Screenshot_3

Sherman0236 commented 3 years ago

I was able to fix this in a super crude manner and this definitely needs more testing, but I will leave that up to you guys. It is likely something to do with IRQL. I found that every hang I encounter occurs upon a debug break. (__debugbreak). The nt logging functions (DbgPrint, vDbgPrintExWithPrefix, etc.) all trigger a debug break. I also encountered the same issue when triggering a debug break within my own code. My jerryrigged solution is simple: Just don't trigger any debug breaks. I did this by commenting out all of the offending HvUtilLogDebug function calls. There were 3 throughout the entire file. (See attached image.)

I then implemented rdtsc(p) handlers and violla.

So in conclusion, the hypervisor causes a hang while rdtsc exiting is enabled if a debug break is triggered. I am not entirely sure why this occurs (probably going right over my head). I can tell that the system is stuck in a while loop containing a pause (spin lock?) as seen in the image provided in my previous post. However, why it gets stuck there is something I have not looked into. I would agree with @Gbps and think that it has something to do with scheduling.

I would love to see the proper solution and an explanation so that I could learn what precisely is going wrong. However, this is becoming too complicated for my existing skill level when it comes to hypervisor development very quickly. I am still utterly confused as to why other hypervisors do not have this issue when I enable rdtsc exiting within them.

Thanks and I hope this helps!

Screenshot_4

tandasat commented 3 years ago

Calling NT's function within the host (hypervisor) context is not a good idea because the processor state is bit unusual, in particular that interrupts are disabled (aside security/isolation). The attached screenshot indicates the processor is sending IPI to the other processor(s). This can be blocked while the other processor is in the host context, and will remain so indefinitely if two processors enter the same path in the host context and wait for each other.

HyperPlatform works because it does not call a DbgPrint family within the host and asynchronously prints them out with a separate thread. Hvpp does essentially the same with ETW. Not using DbgPrint family (and other NT API) would be the right solution.

Sherman0236 commented 3 years ago

Wow! That makes so much more sense. I thought my theory was off, but now I see just how wrong I was. Thanks so much for the help! I really appreciate it!

Gbps commented 3 years ago

Thanks for helping out @tandasat and @Sherman0236! I didn't get a chance to take a look but I'm going to consider this case closed. The only NT functions used in an actual exit handler last time I checked were just the debug prints. Those do not have issues in the default configuration of this project because the default enabled exits all happen at low IRQLs and outside of critical IPI paths. But, when rdtsc exiting is enabled, then things get dicey as rdtsc is called from all sorts of very low interpretability contexts. Makes perfect sense.

The correct thing to do would be to do as Satoshi said and replace the current debug print facilities with one that can be called from any context. ETW sounds like a good candidate for the design, and I do like that idea.

Thanks again everyone.