KhronosGroup / Vulkan-Ecosystem

Public repository for Vulkan Ecosystem issues
Apache License 2.0
133 stars 15 forks source link

Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\Vulkan\ExplicitLayers registry key is purged after driver updates on Windows #38

Closed kondrak closed 5 years ago

kondrak commented 6 years ago

This is an ongoing issue that I've been running into with each consecutive GPU driver update on Windows. Reproduced on 3 different machines with NVidia card but it might as well be vendor independent.

Each time I perform a driver update (either through Windows Update or by downloading the drivers directly from NVidia's website), the Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\Vulkan\ExplicitLayers registry key is purged. This results in Validation Layers not working at all - other Vulkan functionality works perfectly fine. The only known solution to this problem is to reinstall the Vulkan SDK which repopulates all necessary registry keys.

This problem has been encountered by several other people but it seems there's no obvious pattern to reproduce this, so it's not even clear what causes this - either it's an OS issue or something not quite right with the driver installers.

I'll be happy to provide further information that might help identify the root of the problem.

Tobski commented 6 years ago

I've seen a bunch of devs talk about this issue too - it's a pain in the butt and constantly causes issues for developers, which is not a great dev experience. Would love to see this fixed!

kayru commented 6 years ago

I've encountered this too. Validation layers are reported to be present at runtime, but no validation messages ever logged. This is quite confusing, as when stepping through the code things appear to work correctly. It can lead to errors sneaking in, as developer is not aware of anything wrong until it's too late.

Jasper-Bekkers commented 6 years ago

This is a really annoying workflow to have - ran into it so often it became second nature to reinstall the registrykeys or SDK after a driver update. If this turns out to be an IVH issue we should make sure it ends up there as well.

krOoze commented 6 years ago

Same on AMD.

It is time this became mature. And not delete, double, or otherwisely corrupt the entries.

lenny-lunarg commented 6 years ago

What SDK version do you have installed?

There is a known bug that can cause those registry entries to get deleted when installing or removing a pre 1.1.73.0 runtime, while there is a 1.1.73.0 or later SDK already installed. Furthermore, since driver installers should be removing old runtime installers when they are replaced, the first time you upgrade to a newer runtime, this issue will come up. This is the same issue that was at the root of KhronosGroup/Vulkan-ValidationLayers#143.

The long-term solution is to upgrade to 1.1.73 or later in both the runtime that the drivers installs, and the SDK. I just installed Nvidia driver 398.36 and my layer registry entries (from SDK 1.1.77.0) were left alone (the driver installed 1.1.73.0). But its also possible that this driver would break if I had an older SDK installed. I didn't try AMD, and I don't know what version they're installing.

The short-term workaround is to remove all 1.1.70 and earlier SDKs and runtimes, and replace them with the latest SDK. That should solve the problem as long as drivers don't install old runtimes again.

Also, in the validation layer issue, we came to the conclusion that documenting the issue and communicating should be enough. Obviously, based on this issue, we haven't done that well enough. Does anyone have thoughts on the best way to do that? We can't put the documentation in old SDKs that have shipped, and it feels like it would be more useful to document it in old SDKs than new ones, since those are the ones that cause the problem. So where would the best place to document this be?

kondrak commented 6 years ago

This happened even when updating to 398.36 driver and only 1.1.77.0 installed at one time, I had no older SDKs installed.

lenny-lunarg commented 6 years ago

@kondrak, do you know if you had any other runtime installed? The runtimes wouldn't be nearly as obvious as the SDKs. The way to be absolutely sure what's installed is to check your System32 directory (by default it's C:\Windows\System32) and look for files in the format vulkan-1-x-x-x-x.dll. The numbers in place of x's identify the version of the runtime. If you did have any other runtimes installed at the time, its possible that the driver removed them during the installation, which would cause this problem. Unfortunately, it's likely too late to know for sure if that was the case before the driver install.

Also, do you have any other graphics drivers installed? Windows update has been known to install other drivers and its possible that these other drivers are causing trouble.

But it you have only one Nvidia driver and no other runtimes installer, its possible that there's another bug here separate from the known one. I'll have to look into that a little more.

krOoze commented 6 years ago

@lenny-lunarg IMO the documentation is not a problem. I think it is obvious reinstalling SDK will fix this (and OP figured so). It is more how long it drags on (similar issues drag on from 1.0.0). It corrupted, doubled, erased, forgot to unistall, failed to install, or whatever with the RT for as long as I can remember. It is simply resurfacing issue for too long.

TBF I just do this workaround automatically after each driver update now. AMD is supposed to already be on 1.1.73 in beta, so hopefully it should work correctly from next update on... Still if Windows Update driver version interferes, that is a problem.

Furthermore, since driver installers should be removing old runtime installers when they are replaced, the first time you upgrade to a newer runtime, this issue will come up.

Wait, I though they are supposed to coexist. Was that changed? Driver should uninstall the RT and only the RT it installed, no?

lenny-lunarg commented 6 years ago

When the runtime installer was created over two years ago, the behavior that was settled on was over-complicated. This has been causing us trouble for some time. The runtime installer used to keep a copy of the loader and vulkaninfo for every single runtime that got installed. On uninstallation, the runtime would remove the file that it installed, and change the file that doesn't have the version embedded into it to be the latest version that is still installed to the machine. This caused all sorts of trouble because even if you fixed a bug, uninstallers that were triggered by driver installs would remove old runtimes, causing the bug to happen again. On top of that, the logic wasn't particularly useful, as there was no real need to keep around the old versioned copies of the runtime files.

On top of that, the logic to configure layers was put into the runtime installer/uninstaller and not in the SDK installer/uninstaller. This was done to ensure that the layers would only be configured if their version matched the loader version. But that's not useful behavior as those two components are supposed to work even when they're separate versions. And SDK logic should never have made it into the runtime in the first place.

As a result, when Windows changed the requirements for drivers so that they could not use the old runtime installer in future drivers, we tried to redesign this to a much simpler and better system, but the convoluted older behavior proved problematic because we didn't want to break backwards compatibility.

Wait, I though they are supposed to coexist. Was that changed?

Old runtimes are supposed to coexist, and we went out of our way to design a solution where old runtime uninstallers would not downgrade the loader because of the change. But I forgot to account for the fact that old runtime uninstallers would be configuring layers. As a result, our solution didn't take into account validation layer configuration and we broke it. We didn't catch this until after release and we haven't come up with a way to change that, without changing behavior (again).

It is more how long it drags on

The hope is that this overhaul will prevent these issue from coming up again. I am not aware of any problems that have been reported with the new scheme, that weren't compatibility issues. That's part of why I want to establish if this really is a compatibility issue or not.

krOoze commented 6 years ago

The way to be absolutely sure what's installed is to check your System32 directory (by default it's C:\Windows\System32) and look for files in the format vulkan-1-x-x-x-x.dll.

OK, I have a vulkan-1-999-0-0-0.dll :p PS: Am just gonna nuke it; what's the worst that can happen...

Jasper-Bekkers commented 6 years ago

@lenny-lunarg IMO the documentation is not a problem. I think it is obvious reinstalling SDK will fix this (and OP figured so).

This is still not a great workflow.

lenny-lunarg commented 6 years ago

OK, I have a vulkan-1-999-0-0-0.dll :p

999 is used to ensure that the runtime installed by the new machanism will always be considered newer than the old ones. I meant that this will check which old runtimes you have installed.

krOoze commented 6 years ago

@lenny-lunarg Nice hax :p! Anyway, it does not seem to be cleaned up after uninstalling everything... I just deleted it; hope it does not linger somewhere in registry too.

krOoze commented 6 years ago

OK, I tried from what I assume is a clean state. AMD reports 1.1.73, but apparently installs 1.1.70 RT, sigh... And yeah, the layers get deleted from registry.

kondrak commented 6 years ago

@lenny-lunarg I just checked the contents of my System32 folder and here's what I have:

$ ls Windows/System32 | grep vulkan
vulkan-1.dll
vulkan-1-1-0-54-1.dll
vulkan-1-1-0-65-1.dll
vulkan-1-999-0-0-0.dll
vulkaninfo.exe
vulkaninfo-1-1-0-54-1.exe
vulkaninfo-1-1-0-65-1.exe
vulkaninfo-1-999-0-0-0.exe

So it seems there were still some leftover garbage. I'm confused, before upgrading the SDK I always uninstalled the existing one so I'd expect the dlls to be cleared too - or is that something provided by the driver updates? And yes, I only have one set of NVidia drivers, nothing else.

krOoze commented 6 years ago

@kondrak At some point they decided to hide the uninstallers from the users. Go to C:\Program Files (x86)\VulkanRT\version\ where you should find the uninstaller executable for the older RT versions.

kondrak commented 6 years ago

I navigated to VulkanRT and have indeed found older runtimes for 1.0.54.1 and 1.0.65.1 What's still not quite clear to me is if this broken layers behavior has been fixed in latest SDKs according to what @lenny-lunarg because it seems it's still broken for @krOoze ?

krOoze commented 6 years ago

@kondrak My driver installs 1.1.70 (despite reporting 73). Since @lenny-lunarg says everything has to be >= 1.0.73, I would not experience the fixed behavior.

lenny-lunarg commented 6 years ago

I'm confused, before upgrading the SDK I always uninstalled the existing one so I'd expect the dlls to be cleared too - or is that something provided by the driver updates?

I'm not sure where the old runtime installers come from. We've usually seen that happen when drivers install a runtime and then don't remove it. But to my knowledge, Nvidia drivers haven't had any trouble with that, so I don't know how that would be happening on your system. Unfortunately, we don't have any way to track where those runtimes came from, so I can't really say anything about them with confidence.

kondrak commented 6 years ago

On that particular computer I had older Vulkan SDKs installed so chances are these are just leftovers from the older uninstallers not working correctly.

However, I just checked another machine which has the same problem but only had 1.1.70 SDK installed prior to updating to 1.1.77 and here's what I have:

$ ls /cygdrive/c/Windows/System32/ | grep vulkan
vulkan-1.dll
vulkan-1-999-0-0-0.dll
vulkaninfo.exe
vulkaninfo-1-999-0-0-0.exe

I manually uninstalled 1.1.70 before updating to 1.1.77. Then I updated NVidia drivers (using their official installer) and the problem still persisted. My VulkanRT folder now only contains this:

$ ls -l /cygdrive/c/Program\ Files\ \(x86\)/VulkanRT/
install.log
LICENSE.txt
VULKANRT_LICENSE.rtf
VulkanRT-License.txt
pdaniell-nv commented 6 years ago

https://github.com/KhronosGroup/Vulkan-Ecosystem/issues/38#issuecomment-409526961 @kondrak Which NVIDIA driver version did you install?

kondrak commented 6 years ago

I checked that with latest 398.82 drivers for GTX 970, Windows 10 64bit

pdaniell-nv commented 6 years ago

That driver has VulkanRT-1.1.73, which shouldn't have the issue. Hmm.

krOoze commented 6 years ago

AMD beta is now on 77, and the layers seems to survive driver uninstall now.

pdaniell-nv commented 6 years ago

I've tried to reproduce what @kondrak is seeing locally, but I'm having no luck. For me with SDK-1.1.82.0 installed, when I install 398.82, which has RT-1.1.73.0, the SDK remains usable and the registry entries for the layers remains.

I'm curious, do you have any "VulkanRT" entries in: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall

If so, you can execute the "UninstallString" for each one until they all disappear. You can hit F5 in regedit after each uninstall and see the list shrink. With these all gone there is no chance a stale uninstaller gets called by accident.

Another thing I'm curious about. When you install for example 398.82 does it ask you to reboot at the end? Do you ever do a "clean install"? The reason I ask this is because I wonder if on your system the install of 398.82 is going through a currentDriver->someOldDriver->398.82 sequence and the install of someOldDriver is what's messing up the registry. If this is happening one thing you could try is to purge all drivers from your system with a tool like https://www.guru3d.com/files-details/display-driver-uninstaller-download.html so you know the only possible version on your system is 398.82.

kondrak commented 6 years ago

My HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall entry is empty. Frankly, I can't remember if the driver installer asked me to reboot but I know for certain that I never did a clean install of the drivers with recent updates. As of today, I'm running the latest SDK and latest NVidia drivers, as soon as new drivers show up I'll perform an update and will report back if the problem persists. Alternatively I can try and reinstall current drivers if it helps you - just let me know what steps I should follow (ie. a clean install/upgrade/other?).

pdaniell-nv commented 6 years ago

I think waiting for the next driver update makes sense. It should have RT-1.1.77.0 and should be available very soon. Thanks again for your help isolating this issue.

kondrak commented 6 years ago

I have now updated my drivers to 399.07 (performing an update, not a clean install) and for the first time I can see that ExplicitLayers had not been removed from the registry. It seems the problem no longer occurs. Can anyone else confirm this? @kayru @Jasper-Bekkers I know you ran into this too.

pdaniell-nv commented 6 years ago

Awesome. Thanks for trying it out and reporting your findings.

KarenGhavam-lunarG commented 6 years ago

@kayru @Jasper-Bekkers Have you had a chance to verify that updating to 399.07 does not have a problem? I am thinking that this issue can be closed but would like a verification from a few more people.

Thanks!

kayru commented 5 years ago

Haven't experienced the issue so far.

KarenGhavam-lunarG commented 5 years ago

Closing this issue.