JHRobotics / patcher9x

Patch for Windows 9x to fix CPU issues
MIT License
770 stars 39 forks source link

TLB invalidation bug in Windows 2000 with PAE #4

Open DotNetTester opened 2 years ago

DotNetTester commented 2 years ago

Windows 2000 Advanced Server with PAE enabled and nested paging enabled refuses to load anything (goes blank) after completing the log in screen in VirtualBox on my AMD systems. If nested paging is disabled it works OK with PAE. I also have absolutely no problems on my older systems with nested paging and PAE enabled.

I'm thinking this is possibly a TLB invalidation related bug because everything works with nested paging disabled. I have been trying to isolate where this bug is in Windows 2000 with PAE without much luck yet. I suspect that it could be a bug in the PAE kernel or one of the important processes or system files. After seeing this project I was curious about a possible patch. It is a bit of niche case though and only mildly related to Windows 9x.

PAE is the CPU feature that enables 32-bit OSes to use more than 4GB of RAM which is supported by Windows 2000 Advanced Server. PAE is enabled by adding "/pae" to the end of the line in the "boot.ini" file.

DotNetTester commented 2 years ago

I'm curious about whether the same bug exists anywhere in Windows NT 4.0.

anx95 commented 2 years ago

I tested Win2k with PAE (8G RAM, 8CPU) on a Ryzen 7 5800X3D under VMWare. The OS boots and works perfectly with no errors. It's possible the issue is related to VirtualBox. I have also tested the OS under a Ryzen 3700x and 1600 using VMWare. Again, no issues.

DotNetTester commented 2 years ago

I tested Win2k with PAE (8G RAM, 8CPU) on a Ryzen 7 5800X3D under VMWare. The OS boots and works perfectly with no errors. It's possible the issue is related to VirtualBox. I have also tested the OS under a Ryzen 3700x and 1600 using VMWare. Again, no issues.

Is nested paging enabled on both of your virtual machines? Is nested paging supported by VMWare? I don't have much experience with VMWare. :)

JHRobotics commented 2 years ago

Hello @DotNetTester,

I done some tests and yes, this is TLB flushing related bug. I'm surprised - I thought that NT family is free from them - but I now see, that at some setup is not.

I'm sorry but I can't locate this bug closely, it is probably somewhere in ntoskrnl.exe - most of BSOD are from KfReleaseSpinLock (hal.dll) but it is probably only called from ntoskrnl.exe - some function created spinlock, changed page mapping and if it tries to free it, it'll access old memory. I tried injected TLB flush to this function and system is a little more stable, but only little bit (BSOD about 1 minute after logon instead of few second). If I have some time, I'll look at it again, but I'm out of luck today :-(

DotNetTester commented 2 years ago

I've done a bunch of testing and so far.... This bug doesn't appear to affect any configuration of Windows Server 2003 32-bit. Windows Server 2003 32-bit RTM and Service Pack 2 and newer update levels run without issue in VirtualBox with nested paging and PAE on my AMD systems. Windows Server 2003 Service Pack 2 enables PAE by default and it runs great with 8GB of RAM.

Windows 2000 runs great with PAE as long as nested paging is disabled on my AMD systems. Windows 2000 also runs great with nested paging if PAE is disabled on my AMD systems. It's the combination of PAE and nested paging that results in Windows 2000 failing to load in VirtualBox.

DotNetTester commented 2 years ago

The old documentation says that Windows 2000 and Windows Server 2003 use a different kernel for PAE and non-PAE. I suspect that the PAE kernel has a TLB invalidation bug in it, especially after reading your post. If this is the case and the bug in fact only occurs with PAE enabled, a possible fix could be to create a tool that loads into memory at boot time and uses DLL injection to fix the bug? That would be quite practical since PAE is disabled by default in Windows 2000. You could install the tool, add "/pae" to the boot.ini file and reboot.

I am only posting this suggestion because of all the various update levels, service pack levels and the system file protection, system file signing, etc. which could create a mess with directly patching the kernel for this bug in Windows 2000. Of course, you may find a way to work around it all by patching the kernel "on the fly". :)

It might be wise to double check that it doesn't affect Windows 2000 with PAE disabled in any way once the source of the bug is identified.

anx95 commented 2 years ago

I decided to launch my vmware vm image inside virtualbox. As far as this machine is concerned, the OS boots normally. Nested paging is active. image

It could be a conflict/related with virtualbox guest additions.

DotNetTester commented 2 years ago

My virtual machines freeze even without the VirtualBox Guest Additions.

It's possible that this could be a TLB invalidation bug that is only exposed by VirtualBox and somehow doesn't affect VMWare due to it's design or whatever. I'd imagine that each brand of virtualization software has different code designs "under the hood", possibly vastly different. It's also possible that VMWare includes a (built in) workaround for the TLB invalidation bugs when running old operating systems. Does Windows 98 run without out issue on VMWare with modern AMD CPUs?

It is interesting that it works on VMWare and that could represent clues to the source of problem and it's possible that a fix could be made to VirtualBox. I'm unsure as to whether or not the developers of VirtualBox would be willing to fix a VLB Bug in VirtualBox with Windows 2000, if it really is that. The developers haven't made fixes for Windows 9x either. It would be wiser to fix Windows 2000 directly.

DotNetTester commented 2 years ago

I updated to VirtualBox 6.1.36 and Windows 2000 Advanced Server with PAE and nested paging enabled behaves even more like Windows 98 with the TLB invalidation bug. It loads successfully now but with frequent crashes while using Windows 2000 and many installers fail to start and display errors.

DotNetTester commented 2 years ago

I found potentially useful code here for ideas regarding patching the Windows 2000 kernel: https://github.com/evgen-b/PatchPAE3

DotNetTester commented 2 years ago

I tried enabling PAE on Windows 2000 SP4 without any additional patches and it's impacted the same exact way. The TLB invalidation bug impacts a wide range of Windows 2000 Advanced Server installs with PAE enabled on VirtualBox.