KhronosGroup / Vulkan-Loader

Vulkan Loader
https://vulkan.lunarg.com/doc/sdk/latest/linux/LoaderInterfaceArchitecture.html
Other
520 stars 284 forks source link

vkEnumeratePhysicalDevices/vkGetPhysicalDeviceProperties returns incorrect results with multiple nVidia GPUs #283

Closed Keith-Albright-Bose closed 5 years ago

Keith-Albright-Bose commented 5 years ago

I recently got an Akitio Node dock to host a GTX-1660 GPU for use with a Dell Precision 5520. This shows up in device manager and is listed when enumerating devices.

The laptop has 2 GPUs, nVidia Quadro M1200 and Intel P530. Without the dock installed both of these devices are listed (by way of vkEnumeratePhysicalDevices and vkGetPhysicalDeviceProperties)

(0) Quadro M1200 (1) Intel(R) HD Graphics P530

After attaching the GPU dock I get this list: (0) GeForce GTX 1660 (1) GeForce GTX 1660 (2) Intel(R) HD Graphics P530

The Quadro M1200 is missing and the GTX 1660 is listed 2x.

Checking the data returned by a call to vkEnumeratePhysicalDevices does indeed show unique device handles [0] = {m_physicalDevice=0x0000026df68949a0 {...} } [1] = {m_physicalDevice=0x0000026df6894dc0 {...} } [2] = {m_physicalDevice=0x0000026df6894a80 {...} }

yet the name and id are identical for the first two results.

See below for vulkaninfo dump. [0] = {m_physicalDevice=0x0000026df68949a0 {...} } deviceID: 8580 deviceName: 0x0000009bb44fc474 "GeForce GTX 1660"

[1] = {m_physicalDevice=0x0000026df6894dc0 {...} } and vkGetPhysicalDeviceProperties for this device = {m_physicalDevice=0x0000026df6894dc0 {...} } deviceID: 8580 deviceName: 0x0000009bb44fce04 "GeForce GTX 1660"

[2] = {m_physicalDevice=0x0000026df6894a80 {...} } and vkGetPhysicalDeviceProperties for m_physicalDevice = 0x0000026df6894a80 {...} deviceID: 6429 deviceName: 0x0000009bb44fc474 "Intel(R) HD Graphics P530"

Some other context if it matters, I've seen nSight Graphics report that my system is known as Microsoft Windows Hybrid Graphics.

I am doing compute with Vulkan and using multiple GPUs in parallel but this issue prevents using all the available resources since effectively I'm limited to using 2 of the 3 GPUs.

pdaniell-nv commented 5 years ago

@Keith-Albright-Bose would you mind attaching the complete vulkaninfo capture. What OS is this on? Thanks.

lenny-lunarg commented 5 years ago

It sounds like to summarize quickly, you're having two problems:

  1. When you plug in your GTX 1660, the Quadro no longer shows up
  2. The GTX 1660 is reported twice

I need to do some investigation on the first issue, especially if this is Windows. On the second issue, the loader basically just passes on what the driver gives it, but there is a chance the loader is messing something up and loading a driver twice. I'll look into this a little further. Also, I second the questions @pdaniell-nv just posted.

Keith-Albright-Bose commented 5 years ago

@Keith-Albright-Bose would you mind attaching the complete vulkaninfo capture. What OS is this on? Thanks.

This is on Windows 10. Attached two vulkaninfo files one with the dock and one without. MultiGPUInfoDockAttached.txt MultiGPUInfoNoDock.txt

Keith-Albright-Bose commented 5 years ago
  1. When you plug in your GTX 1660, the Quadro no longer shows up
  2. The GTX 1660 is reported twice

To clarify the Vulkan loader doesn't list it but it does appear in Device Manager so the Host OS definitely sees all three GPUs.

Here's info from powershell and Get-WmiObject win32_videocontroller

videocontroller.txt

pdaniell-nv commented 5 years ago

@Keith-Albright-Bose Thanks for all that information. Could you do another capture of the vulkaninfo when the dock is attached with the following options:

set VK_LOADER_DEBUG=all
vulkaninfo 1> C:\MultiGPUInfoDockAttachedDebug.txt 2>&1

That will capture both the stdout and stderr from vulkaninfo with all the loader debug information enabled.

I tried to reproduce the issue you're seeing locally with 441.20 on a similar setup, but wasn't able to repeat what you're seeing. The vulkaninfo capture with debug information from the loader will help.

Thanks again for your help.

Keith-Albright-Bose commented 5 years ago

@pdaniell-nv Thanks for the debug switch. Ahh the ol redirect stderr to stdout. BTW related to tools if you can get me connected to someone that can provide access to a tool (prerelease or otherwise) that can help optimize compute only shaders. Interested in the hw counter info. nSight Graphics currently can't connect. It does work for Vulkan apps that render and compute. Had heard there were various tools available through request.

MultiGPUInfoDockAttachedDebug.txt

lenny-lunarg commented 5 years ago

The thing that immediately jumps out to me from this latest log is that it looks like you're loading two Nividia drivers:

INFO: loaderAddJsonEntry: Located json file "C:\WINDOWS\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_1ffb45b74346b667\nv-vk64.json" from PnP registry: E
INFO: loaderAddJsonEntry: Located json file "C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_1c1eeed184017b59\nv-vk64.json" from PnP registry: E

The loader has some logic in it to strip out duplicate driver entries, but these aren't duplicates, so the loader isn't going remove either of these. I would suspect that both drivers are returning the GTX 1660 and that's why it appears twice (but I don't have enough info to be sure). The question then is why there's duplicate entries, but I think @pdaniell-nv would be in a better positition than myself to answer that. I would imagine its related to having both the Quadro and GTX, and possibly a driver for each. I will mention that both of those drivers are coming out of the PnP registry. In a very recent change, we added loader functionality to get drivers from D3DKMTEnumAdapters2 and D3DKMTQueryAdapterInfo. Among other things, that change was supposed to be better at dropping drivers for GPUs that aren't plugged in. I don't know if that would solve this problem, but it might. It's also possible that we might be able to smarten up the PnP logic. I'm curious if @pdaniell-nv has any thoughts.

pdaniell-nv commented 5 years ago

The appearance of two driver entries should not happen and is likely a driver installer issue, albeit a very complicated one since this is an OEM machine with a docked consumer card. There should be one driver covering all GPUs. However, what's puzzling to me is why each driver doesn't enumerate two physical devices each.

@Keith-Albright-Bose if you're willing and able I have another experiment for you to try:

In the HKLM\SYSTEM\CurrentControlSet\Control\Class{4d36e968-e325-11ce-bfc1-08002be10318} registry you'll find a bunch of entries for the various display adapters 0000, 0001, 0002, etc. Find the one where the "VulkanDriverName" points to "C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_1c1eeed184017b59\nv-vk64.json", then rename the "VulkanDriverName" value name to "VulkanDriverName.ignore" and "VulkanDriverNameWow" to "VulkanDriverNameWow.ignore".

Grab the vulkaninfo again with loader debug enabled. That should cause the loader to only pick up one of the drivers and stop the duplicate "GeForce GTX 1660" entries. And if you're lucky may even cause the "Quadro M1200" to show up too.

NOTE: You'll need to put those name back after you remove the dock (eGPU) so Vulkan works again on the Quadro M1200.

Keith-Albright-Bose commented 5 years ago

Piers, I set ignore on the requested keys for that UUID, checked and then did same for another one with the nv-vk64.json value with a different UUID just to see if the M1200 showed up. It did not. As you said I only got 1 1660 in both cases.

Attached are two files, the one you requested and a reg export of the key you had me edit (with original values) For safety I did a text export rather than a reg. I wonder if the Microsoft Windows Hybrid Graphics System is having an effect. This is happening on a Dell Precision 5520 laptop. I may be able to get a run on a coworkers 5540 to check if there's a repro there tomorrow.

MultiGPUInfoDockAttachedDebugRegIgnore.txt

multiGPURegDisplayAdapter.txt

pdaniell-nv commented 5 years ago

Thanks for trying that experiment. I'm not sure why the "Quadro M1200" isn't showing up. We'll need to replicate the issue locally to debug it. I'll ask around to see if anyone has a similar Dell laptop.

Keith-Albright-Bose commented 5 years ago

Will let ya know the results of running on 5540 as well as 7520. I reached out on LinkedIn if you want to email me any specific info. I can try and capture something (save off minidump along with .pdb matching what is built locally) Running VS 2017 or 2019. Or if you need something from WinDbg, let me know.

Keith-Albright-Bose commented 5 years ago

Update: Ran on a Dell Precision 5540. Same issue: We made sure to install DCH drivers for the Quadro T1000 as well as the GTX 1660. Here's the debug log from that machine: MultiGPUInfoQuadroT1000DockAttachedDebug.txt

Have built the vulkan-loader and run the Validation tests both with and without the dock attached. Both cases passed.

Not sure what debug info you'd like:

Here's the state of inst just after loading in setupLoaderTrampPhysDevs, followed by 5 calls to terminator_GetPhysicalDeviceProperties and then the state of inst at the end of the setupLoaderTrampPhysDevs.

setupLoaderTrampPhysDevsInst_AtStart.txt terminator_GetPhysicalDeviceProperties_1stCall.txt terminator_GetPhysicalDeviceProperties_2ndCall.txt terminator_GetPhysicalDeviceProperties_3rdCall.txt terminator_GetPhysicalDeviceProperties_4thCall.txt terminator_GetPhysicalDeviceProperties_5thCall.txt setupLoaderTrampPhysDevs_InstAtEnd.txt

pdaniell-nv commented 5 years ago

We've determined this is a driver bug, and a fix is being developed. Unfortunately there is no work around, so an updated driver is required. We'll include the fix in an upcoming Vulkan beta driver. Since this isn't a loader bug this issue can be closed.

lenny-lunarg commented 5 years ago

I had gathered from the last few comments that it was probably a driver bug, but it's still good to hear the confirmation. Thanks