felixdoerre / primus_vk

Vulkan GPU-offloading layer
BSD 2-Clause "Simplified" License
230 stars 18 forks source link

Dual Nvidia GPU #81

Closed jonpas closed 3 years ago

jonpas commented 3 years ago

I currently have 2 Nvidia GPUs installed:

After removing modprobe.d/bumblebee.conf (which disables Nvidia driver use), optirun/primusrun and pvkrun with OpenGL applications works well, renders on the 1060 and displays on the 710.

My xorg.conf has Driver "nvidia" set for BusID "PCI:10:0:0" which is the GT 710. PrimusVK is installed from AUR (also tested with primus_vk).

However, when trying a Vulkan application (pvkrun vkcube), I get the following:

PrimusVK: Searching for display GPU:
PrimusVK: 0x55d532a02bd0: 
PrimusVK: 0x55d532bf9fd0: 
PrimusVK: Searching for render GPU:
PrimusVK: 0x55d532a02bd0.
PrimusVK: Got discrete gpu!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: No device for the display GPU found. Are the intel-mesa drivers installed?
PrimusVK: VK_ICD_FILENAMES not set
vkCreateInstance failed.

Do you have a compatible Vulkan installable client driver (ICD) installed?
Please look at the Getting Started guide for additional information.

Then I modified /etc/bumblebee/xorg.conf.nvidia with BusID "PCI:09:00:0" which is the GT 1060, and I get:

PrimusVK: Searching for display GPU:
PrimusVK: 0x558a9cb9d130: 
PrimusVK: 0x558a9cbdf530: 
PrimusVK: Searching for render GPU:
PrimusVK: 0x558a9cb9d130.
PrimusVK: Got discrete gpu!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
PrimusVK: No device for the display GPU found. Are the intel-mesa drivers installed?
PrimusVK: VK_ICD_FILENAMES not set
vkCreateInstance failed.

Do you have a compatible Vulkan installable client driver (ICD) installed?
Please look at the Getting Started guide for additional information.

Progress, but host GPU still doesn't want to be the display one. I've also tried setting VK_ICD_FILENAMES but no success so far. I understand this is likely a problem with 2 Nvidia GPUs.

felixdoerre commented 3 years ago

As the display gpu is not marked as "internal GPU" the auto-decection of GPU devices fails. I'd suggest, that you set the ids of the graphics devices manually and choose the graphics devices by hand:

PRIMUS_VK_DISPLAYID
PRIMUS_VK_RENDERID

You can find out the necessary ids with vulkaninfo. Look for this section:

GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion     = 4202629 (1.2.133)
        driverVersion  = 1888518144 (0x70908000)
        vendorID       = 0x10de
        deviceID       = 0x1436
        deviceType     = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName     = Quadro M2200

This would result in running applications with:

PRIMUS_VK_RENDERID=10de:1436 pvkrun vkcube

Try optirun -b none vulkaninfo to get the ids from both devices.

The output you shared looks really promising as it indicates, that both devices are reported and just primus fails in identifying which device to use for which role.

jonpas commented 3 years ago

vulkaninfo directly throws:

ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.153/vulkaninfo/vulkaninfo.h:247:vkGetPhysicalDeviceSurfaceFormats2KHR failed with ERROR_INITIALIZATION_FAILED

But I got the IDs for PRIMUS_VK_DISPLAYID and PRIMUS_VK_RENDERID using lspci -nn, but the 1c03 (1060) doesn't get picked it, it instead matches the 710 by the vendorID (presumably before it finds the 1060). Using BusID to set the primary to 1060 (as in original post above) just reverses that and both become 1060.

$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vkcube
PrimusVK: Searching for display GPU:
PrimusVK: 0x564d5ed99bd0: 
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x564d5ed99bd0.
PrimusVK: Got device from env! (via vendorID)
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: fetching dispatch for 0x564d5f0609a0
PrimusVK: Creating display device finished!: 0
PrimusVK: fetching dispatch for 0x564d5ef93340
PrimusVK: CreateDevice done
PrimusVK: Application requested 3 images.
PrimusVK: Creating Swapchain for size: 417x519
PrimusVK: MinImageCount: 3
PrimusVK: fetching device for: 0x564d5ef93340
PrimusVK: FamilyIndexCount: 0
PrimusVK: Dev: 0x564d5f0609a0
PrimusVK: Swapchainfunc: 0x7f9051ea1720
PrimusVK: >> Swapchain create done -3;0x7c00000
Segmentation fault (core dumped)
jonpas commented 3 years ago

With optirun -b none vulkaninfo I do get both GPU0 and GPU1, but they are both the same, depending on what is set in /etc/bumblebee/xorg.conf.nvidia.

I should note, my main xorg.conf only has the 710 setup, and the 1060 is pre-bound to vfio-pci via kernel parameter so Nvidia doesn't hijack it and forbid rebinding later. I was using an Intel iGPU before upgrading to AMD 3700X and I didn't need that previously as Nvidia driver was only loaded as required (obviously iGPU was using the mesa driver). This could play a part.

felixdoerre commented 3 years ago

optirun -b none vulkaninfo showing two time the same GPU is concerning. This indicates that the nvidia driver did not correctly recognize your hardware. But bumblebee starting with the second GPU assigned is a good sign. Maybe we need to really load the nvidia driver twice, once per GPU... but that would probably require some hacks specific for your setup. For primus_vk to "start working" we need to see both devices working vulkan, which I assume we do not yet.

There actually seems to be a bug in the device-selection code, if both devices are from the same vendor. Could you add a && device == 0 here: https://github.com/felixdoerre/primus_vk/blob/0c63679ea07d950375ab9bb1362f0249ea7af7db/primus_vk.cpp#L129. That should allow you to correctly try to select the other GPU. But as you already said that vulkaninfo shows the same GPU twice I guess we need to fix that first....

So let me just ask a few questions/state assumptions about your system setup:

The nvidia driver has the the strange quirk to connect to the (current) X-Server (for whatever reason), even before selecting graphics devices. I built nv_vulkan_wrapper to force the nvidia driver to see :8 as DISPLAY environment variable to force it to see the secondary GPU. However on your hardware probably this causes the driver to not detect the GPU on :0. Interestingly the nvidia driver detects correctly that there are two graphics cards, but (wrongly) detects what card they are. Probably one has to experiment more with the nvidia graphics driver to understand how one could convince it to detect both cards correctly.

jonpas commented 3 years ago

primus_vk/primus_vk.cpp

Change prevents selection of same ID, but now the second one doesn't find anything, so it only finds either display or render (depending on which I put where) - and always the one set in xorg.conf.nvidia.

You are running a X-Server or Wayland server that correctly detects and uses the nvidia graphics card as your primary X-Server. You configured your main xorg.conf to only use the 710 graphics card.

Correct. X-Server detects and uses the 710 - visible in nvidia-settings (which 1060 is not as it's not in the same X-Server I assume) and confirmed with:

$ glxinfo | grep renderer
OpenGL renderer string: GeForce GT 710/PCIe/SSE2

My xorg.conf:

Section "Device"
    Identifier "Nvidia"
    Driver "nvidia"
    VendorName "NVIDIA Corporation"
    BoardName "GeForce GT 710"
    BusID "PCI:10:0:0"
EndSection

When you configure bumblebee to use the 1060, bumblebee manages to power up the 1060 GPU and successfully launches the secondary X-Server on :8

Correct. I can run non-Vulkan applications completely fine using optirun or primusrun, and they even show up in nvidia-smi.

$ optirun glxinfo | grep renderer
OpenGL renderer string: GeForce GTX 1060 6GB/PCIe/SSE2

$ primusrun glxinfo | grep renderer
OpenGL renderer string: GeForce GTX 1060 6GB/PCIe/SSE2
felixdoerre commented 3 years ago

Wow, primus works! I wasn't expecting that. And to be honest, I don't really understand, why. Probably, because in the OpenGL-Context we always pass along the expected XDisplay explicitly.

I've just discovered (by accident) that the nvidia driver misbehaves less, if it cannot open an X-Display. Could you just change the value of NV_BUMBLEBEE_DISPLAY here to some garbage value like hello? Another try might be to be to use /usr/share/vulkan/icd.d/nvidia_icd.json (I assume that was removed according to the installation instructions). I am beginning to suspect that for your setup an "untouched" nvidia driver could work. So could you try out what the results are, if you have no nv_vulkan_wrapper.json but the original nvidia_icd.json? A third try: register libEGL_nvidia.so.0 as vulkan icd. As suggested in #24 this might get your setup working.

jonpas commented 3 years ago

I've just discovered (by accident) that the nvidia driver misbehaves less, if it cannot open an X-Display. Could you just change the value of NV_BUMBLEBEE_DISPLAY here to some garbage value like hello?

That change together with && device == 0 results in correct detection but errors later:

$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vkcube
PrimusVK: Searching for display GPU:
PrimusVK: 0x55695356d380: 
PrimusVK: 0x55695356d3b0: 
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x55695356d380.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
PrimusVK: fetching dispatch for 0x55695370ffa0
PrimusVK: Creating display device finished!: 0
PrimusVK: fetching dispatch for 0x556953641bd0
PrimusVK: CreateDevice done
Can't find our preferred formats... Falling back to first exposed format. Rendering may be incorrect.
Segmentation fault (core dumped)

Another try might be to be to use /usr/share/vulkan/icd.d/nvidia_icd.json (I assume that was removed according to the installation instructions). I am beginning to suspect that for your setup an "untouched" nvidia driver could work. So could you try out what the results are, if you have no nv_vulkan_wrapper.json but the original nvidia_icd.json?

I actually never had to remove that file in the past (using AUR primus-vk-git), so I had it there now as well. Moving it away doesn't change anything. Using only original nvidia_icd.json will result in no 1060 being detected by PrimusVK.

A third try: register libEGL_nvidia.so.0 as vulkan icd. As suggested in #24 this might get your setup working.

No change from the solution of above code changes.

felixdoerre commented 3 years ago

Interesting. That seems to be closer to our goal. Could you show PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vulkaninfo, so we can see the missing image formats, vkcube is complaining about?

jonpas commented 3 years ago
$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vulkaninfo
PrimusVK: Searching for display GPU:
PrimusVK: 0x55f5ea5af8d0: 
PrimusVK: 0x55f5ea5af900: 
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x55f5ea5af8d0.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.153/vulkaninfo/vulkaninfo.h:247:vkGetPhysicalDeviceSurfaceFormats2KHR failed with ERROR_INITIALIZATION_FAILED
felixdoerre commented 3 years ago

Hmm.... I think we need more debug output. Can you try vkcube and vulkaninfo with this branch? https://github.com/felixdoerre/primus_vk/tree/surface_format_output

jonpas commented 3 years ago

Had to comment out regeneration of that file from the xlst.

$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vkcube
PrimusVK: Searching for display GPU:
PrimusVK: 0x55e8c075b6f0: 
PrimusVK: 0x55e8c075b720: 
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x55e8c075b6f0.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
PrimusVK: fetching dispatch for 0x55e8c08fe330
PrimusVK: Creating display device finished!: 0
PrimusVK: fetching dispatch for 0x55e8c082ff60
PrimusVK: CreateDevice done
PrimusVK: Querying surface formats (base) returned: -3
PrimusVK: Querying surface formats (base) returned: -3
Can't find our preferred formats... Falling back to first exposed format. Rendering may be incorrect.
Segmentation fault (core dumped)
$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vulkaninfo
PrimusVK: Searching for display GPU:
PrimusVK: 0x557c406ba8d0: 
PrimusVK: 0x557c406ba900: 
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x557c406ba8d0.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
PrimusVK: Querying surface formats returned: -3
ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.153/vulkaninfo/vulkaninfo.h:247:vkGetPhysicalDeviceSurfaceFormats2KHR failed with ERROR_INITIALIZATION_FAILED
felixdoerre commented 3 years ago

Ok, so we know that the call to vkGetPhysicalDeviceSurfaceFormats goes wrong even when we have both gaphics devices. If I'd have access to the hardware, I would interactively try to understand what happens in the loader. So I guess we'd need to try that remote. Please make sure you have debug symbols for libvulkan installed. I'd suggest we try to debug vkcube (first) as this seems the easier application.

So run PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun gdb vkcube Set a breakpoint here: https://github.com/KhronosGroup/Vulkan-Loader/blob/fa696ca02c7fcd488602a0e0132e26b49cfaa836/loader/wsi.c#L744 (Or on the analogous locations for other display server types. terminator_CreateXlibSurfaceKHR... I am not sure which type vkcube will use on your system).

Check the loop over the icds. Make sure that the nvidia icd is one of them and there is only a single instance. Also take a look on other icds in that list and tell me what they are.

Set a breakpoint here: https://github.com/KhronosGroup/Vulkan-Loader/blob/fa696ca02c7fcd488602a0e0132e26b49cfaa836/loader/wsi.c#L322 Make sure that the control flow steps into this if-condition here: https://github.com/KhronosGroup/Vulkan-Loader/blob/fa696ca02c7fcd488602a0e0132e26b49cfaa836/loader/wsi.c#L349 When you're in, check the icd that is invoked there (e.g. by inspecting the value of phys_dev_term). It must be the nvidia icd.

jonpas commented 3 years ago

Check the loop over the icds. Make sure that the nvidia icd is one of them and there is only a single instance. Also take a look on other icds in that list and tell me what they are.

There is Nvidia ICD and only 1 instance. The other one is PrimusVK wrapper.

// loop 1
(gdb) p *icd_term->scanned_icd
$8 = {
  lib_name = 0xda1330 "libGLX_nvidia.so.0",
  handle = 0x6874e0,
  api_version = 4202638,
  interface_version = 5,
  GetInstanceProcAddr = 0x7ffff71607d0 <vk_icdGetInstanceProcAddr>,
  GetPhysicalDeviceProcAddr = 0x7ffff7160760 <vk_icdGetPhysicalDeviceProcAddr>,
  CreateInstance = 0x7ffff3b54c40,
  EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}

// loop 2
(gdb) p *icd_term->scanned_icd
$9 = {
  lib_name = 0xda11b0 "libnv_vulkan_wrapper.so.1",
  handle = 0x647110,
  api_version = 4198484,
  interface_version = 5,
  GetInstanceProcAddr = 0x7ffff78fd259 <vk_icdGetInstanceProcAddr(VkInstance, char const*)>,
  GetPhysicalDeviceProcAddr = 0x7ffff78fd2ab <vk_icdGetPhysicalDeviceProcAddr(VkInstance, char const*)>,
  CreateInstance = 0x7ffff3b54c40,
  EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}

Make sure that the control flow steps into this if-condition here: KhronosGroup/Vulkan-Loader@fa696ca/loader/wsi.c#L349 When you're in, check the icd that is invoked there (e.g. by inspecting the value of phys_dev_term). It must be the nvidia icd.

Steps in, here is the ICD.

(gdb) p *phys_dev_term
$11 = {
  disp = 0xd9ecf0,
  this_icd_term = 0xf42180,
  icd_index = 0 '\000',
  phys_dev = 0xf47168
}
felixdoerre commented 3 years ago

There is Nvidia ICD and only 1 instance. The other one is PrimusVK wrapper.

The primus_vk wrapper actually tries to behave identical/substituting to the original nvidia_icd, so I'd have counted that as 2 instances.

Ok, so it seems that the display device is currently obtained through the "directly" installed icd, which could be bad (who really knows). Could you provide the result of these debugging steps, when nvidia_icd.json is removed? (And nv_vulkan_wrapper either setting DISPLAY to garbage or using libEGL_nvidia.so.0). That way we would be sure to only have a single nvidia-driver instance (the wrapped instance) is available.

I am not completely sure that we clarified this, but re-reading some previous posts made me realize, that the nvidia driver fails you even if we only have the host GPU, right? So when you run just plain vulkaninfo (without any modifying env variables) while bumblebee has the secondary graphics card physically powered off, you still get vkGetPhysicalDeviceSurfaceFormats2KHR returning ERROR_INITIALIZATION_FAILED. Also I am not sure that I asked you about your nvidia driver version. Could you just post that for completeness? I'd assume it is a rather new one, that should support the version 2 of vkGetPhysicalDeviceSurfaceFormats, but just to be safe...

jonpas commented 3 years ago

that the nvidia driver fails you even if we only have the host GPU, right? So when you run just plain vulkaninfo (without any modifying env variables) while bumblebee has the secondary graphics card physically powered off, you still get vkGetPhysicalDeviceSurfaceFormats2KHR returning ERROR_INITIALIZATION_FAILED.

If I rebind the 1060 back to vfio-pci, then primusrun/pvkrun doesn't work at all and throws the following error (optirun works though).

primus: fatal: Bumblebee daemon reported: error: [XORG] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!

However that's because I also set my 1060 to be used by Bumblebee by default in /etc/bumblebee/xorg.conf.nvidia, if I also revert that to original I do indeed get the same error only with the 710.

$ pvkrun vulkaninfo                                                 ~
PrimusVK: Searching for display GPU:
PrimusVK: 0x562b1db4b860: 
PrimusVK: Searching for render GPU:
PrimusVK: 0x562b1db4b860.
PrimusVK: Got discrete gpu!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: No device for the display GPU found. Are the intel-mesa drivers installed?
PrimusVK: VK_ICD_FILENAMES not set
ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.153/vulkaninfo/vulkaninfo.h:667:vkCreateInstance failed with ERROR_INITIALIZATION_FAILED
vulkaninfo: ../nptl/pthread_mutex_lock.c:428: __pthread_mutex_lock_full: Assertion `e != ESRCH || !robust' failed.
Aborted (core dumped)

nvidia driver version

455.28

Could you provide the result of these debugging steps, when nvidia_icd.json is removed?

Same error as above (with only 710 active) after removing the Nvidia ICD. If I reactivate the 1060, also same problem, but it defaults the render to 1060 (as it should and as it was before).

I re-added Nvidia ICD just to clarify, and now I am getting 4 diplay GPUs found, until removing it again. Unsure how that happened, like it is loading them both now.

PrimusVK: Searching for display GPU:
PrimusVK: 0x5595e96ad510: 
PrimusVK: 0x5595e96ad540: 
PrimusVK: 0x5595e9734760: 
PrimusVK: 0x5595e9734790: 

Here are the results of the same debugging steps (without Nvidia ICD).

// this time just one
(gdb) p *icd_term->scanned_icd 
$2 = {
  lib_name = 0xda12b0 "libnv_vulkan_wrapper.so.1",
  handle = 0x647110,
  api_version = 4198484,
  interface_version = 5,
  GetInstanceProcAddr = 0x7ffff78fd259 <vk_icdGetInstanceProcAddr(VkInstance, char const*)>,
  GetPhysicalDeviceProcAddr = 0x7ffff78fd2ab <vk_icdGetPhysicalDeviceProcAddr(VkInstance, char const*)>,
  CreateInstance = 0x7ffff3b54c40,
  EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}

(gdb) print *phys_dev_term
$3 = {
  disp = 0xda2030,
  this_icd_term = 0xe70b80,
  icd_index = 0 '\000',
  phys_dev = 0xf457d8
}

And nv_vulkan_wrapper either setting DISPLAY to garbage or using libEGL_nvidia.so.0.

Not sure what you mean with that. I did leave NV_BUMBLEBEE_DISPLAY on "hello".

felixdoerre commented 3 years ago

I've asked a friend (thanks @janisstreib) and have gotten access to a system with 2 Nvidia GPUs. I was not able to replicate the setup with vfio-pci that you are describing, maybe you can help me with that. What I've done: The system booted with the two gpus:

$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 670] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GK107 [GeForce GTX 650] (rev a1)

With these patches PRIMUS_VK_DISPLAYID=10de:1189 PRIMUS_VK_RENDERID=10de:0fc6 pvkrun vkcube works flawlessly: https://github.com/felixdoerre/primus_vk/tree/nv_display_fixes Maybe these are enough to fix the behavior for you as well :-)

Otherwise you'd have to help me to get the vfio + bumblebee configuration working. (You don't use bbswitch in that mix, right?) Currently I have added amd_iommu=on to the kernel cmdline (it's an AMD cpu), however I believe that enabling the iommu should not be necessary. I seem to need to enable something in the BIOS which I didn't do (yet):

[    0.123712] AGP: Please enable the IOMMU option in the BIOS setup
[    0.667557] iommu: Default domain type: Translated 
[    2.166554] PCI-DMA: using GART IOMMU.
[    2.166607] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
[    2.175356] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
[    2.175412] AMD-Vi: AMD IOMMUv2 functionality not available on this system

Regarding vfio: I have added the vfio modules to /etc/modules:

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

and configured vfio:

$ cat /etc/modprobe.d/vfio.conf 
options vfio-pci ids=10de:0fc6

However vfio doesn't seem to be able to get the device:

$ sudo dmesg | grep vfio
[   12.482173] vfio-pci: probe of 0000:02:00.0 failed with error -22
[   12.482238] vfio_pci: add [10de:0fc6[ffffffff:ffffffff]] class 0x000000/00000000

even when I manually unload the nvidia driver (modprobe -r nvidia) after stopping all GUI output, and re-modprobe vfio-pci manually, it does not work and only reports the same error -22. Can you see what I did wrong? Do you think I need to get the IOMMU working? Can you share a bit more configuration details so I can get bumblebee working on that system like you have it, so I can replicate your problems better? I've never received ERROR_INITIALIZATION_FAILED from vkGetPhysicalDeviceSurfaceFormats2KHR.

jonpas commented 3 years ago

Sadly that didn't fix it. I am starting to think it's something with the way I configured the GPUs and drivers so will try to do that from scratch.

Correct, no bbswitch is in use. But yes, you do need to enable IOMMU groups in your BIOS, otherwise there is no way to decouple the GPU from the rest of the system. Since you have an AMD as well, depending on the board, you'd enable "IOMMU" and "ACS" (in the form of implementation, not kernel patch) in the BIOS.

jonpas commented 3 years ago

I disabled kernel binding the 1060 to vfio-pci on boot, so now they are both listed in nvidia-settings directly. However, still getting the exact same error.

Are you sure your display GPU is actually using the nvidia driver and not llvmpipe, as Bumblebee blacklists Nvidia driver by default in /usr/lib/bumblebee.conf.

felixdoerre commented 3 years ago

Hmm, I didn't remove /etc/modprobe.d/bumblebee.conf where the bumblebee blacklist is configured on debian, but it wasn't a problem:

$ cat /etc/modprobe.d/bumblebee.conf 
...
blacklist nvidia

lsmod shows that the nvidia driver is loaded after boot and the X-server has open file handles on /dev/nvidia0, so I guess that should indicate the the real nvidia driver is in use, right? Additionally the log of the primary X server indicates that the nvidia driver is used:

[    26.926] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
....
[    27.268] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so

After a BIOS upgrade the vfio-binding now seems to work (still not having removed /etc/modprobe.d/bumblebee.conf yet):

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK104 [GeForce GTX 670] [10de:1189] (rev a1) (prog-if 00 [VGA controller])
        ....
    Kernel driver in use: nvidia
    Kernel modules: nvidia

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107 [GeForce GTX 650] [10de:0fc6] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: eVga.com. Corp. GK107 [GeForce GTX 650] [3842:2650]
        ....
    Kernel driver in use: vfio-pci
    Kernel modules: nvidia

However bumblebee fails to activate the secondary GPU:

[  7448.757] (II) NVIDIA GLX Module  450.66  Wed Aug 12 19:41:37 UTC 2020
[  7448.757] (II) NVIDIA: The X server supports PRIME Render Offload.
[  7448.757] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[  7448.757] (EE) NVIDIA(0): Failing initialization of X screen
[  7448.757] (II) UnloadModule: "nvidia"

This error persists, even if in rmmod vfio-pci manually before trying to use optirun. I have configured bumblebee's nvidia-xorg-config to explicitly search for the correct GPU:

Section "Device"
    Identifier  "DiscreteNvidia"
    Driver      "nvidia"
    VendorName  "NVIDIA Corporation"

    BusID       "PCI:2:0:0"

    Option "ProbeAllGpus" "false"

    Option "NoLogo" "true"
    Option "UseEDID" "false"
    Option "UseDisplayDevice" "none"
EndSection

When I to not use vfio-pci to take the graphics card out of the nvidia driver's hands but configure the primay xserver to only use one graphics card:

....
        Option      "AutoAddDevices" "false"
        Option      "AutoAddGPU" "false"
...
Section "Device"
        Identifier  "Card21"
        Driver      "nvidia"
        BusID       "PCI:1:0:0"
        Option "ProbeAllGpus" "false"
EndSection

I can reproduce the problem of the segfaulting nvidia driver in GetPhysicalDeviceSurfaceFormats2KHR. I believe, when we set the DISPLAY environment variable to garbage, we turn the nvidia driver into a "headless" mode where all the windowing-system-related functions like GetPhysicalDeviceSurfaceFormats2KHR don't work at all. For the "normal" primus-operation this is no problem, as we use the intel/mesa driver for all those functions. In this usecase this is a problem.

So from my experimenting I'd say that nvidia's vulkan driver cannot be convinced to detect 2 gpus correctly when they are only available on different x servers. I'd consider that a bug in the nvidia driver where we sadly can't work around.

If you accept binding both nvidia cards to the same X-Server (their names show up in Xorg.0.log) the nvidia driver will successfully detect both graphics cards and operate them correctly. Nvidia's "solution" to gpu offloading still does not work in this setup (when I tried it, the application said that no graphics queue could be found) but primus-vk (with the special branch I provided) will be able to render on the secondary GPU, without the need to have a screen attached to the secondary GPU. You must not need to have nv_vulkan_wrapper installed in this case. Bumblebee is also not needed (but also does not hurt).

I was not able to reproduce the setup where one GPU is bound by vfio-pci and then later the nvidia driver takes over. I do not know how you would spawn a "second instance" of the nvidia module which would handle the secondary graphics card, that only is available later. From what I observed, the nvidia driver will only detect graphics cards once on startup and I currently see no way to "add" the graphics card which is initially occupied by vfio-pci to the nvidia driver later.

jonpas commented 3 years ago

so I guess that should indicate the the real nvidia driver is in use, right?

Most likely. I found the most sure way to know is to run glxinfo | grep renderer and if the renderer string is your GPU name, it uses Nvidia driver, otherwise it will say llvmpipe.

However bumblebee fails to activate the secondary GPU: This error persists, even if in rmmod vfio-pci manually before trying to use optirun.

You do have to rebind the driver to nvidia. I do that in my VM script (which has an option of just straight up rebinding for my system).

If you accept binding both nvidia cards to the same X-Server

I will try that again tomorrow. Sadly binding to Nvidia at the start then fails to rebind into vfio-pci as the driver keeps a hold on it unless X session is terminated (unwanted). Maybe I can figure out how to unbind it gracefully from that and rebind using the same driver instance.

felixdoerre commented 3 years ago

First of all, thanks for explaining how you rebind the pci-devices between drivers. Works perfectly :-)

After very much experimenting, I believe I've got it running with the 2 gpus on 2 different X-Servers:

$ PRIMUS_VK_DISPLAYID=10de:1189 PRIMUS_VK_RENDERID=10de:0fc6 pvkrun vkcube
0x55f69307a688
PrimusVK: Searching for display GPU:
PrimusVK: 0x55f693595120: 4318;4038
PrimusVK: 0x55f693595150: 4318;4489
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 670
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x55f693595120.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 650
PrimusVK:   Type: 2
PrimusVK: fetching dispatch for 0x55f693663430
PrimusVK: Creating display device finished!: 0
PrimusVK: fetching dispatch for 0x55f6935956f0

I have pushed the necessary hacks on the branch nv_display_fixes that I've already shared. The setup/configuration is a bit different that "usual" primus-vk. Most importantly: you must not disable the original nvidia_icd.json in /usr/share/vulkan/icd.d, as this is used to load the driver for :0. nv_vulkan_wrapper is used to obtain the instance for :8.

``` $ cat /etc/xorg.conf Section "ServerLayout" ... Option "AutoAddDevices" "false" Option "AutoAddGPU" "false" EndSection ... Section "Device" Identifier "Card21" Driver "nvidia" BusID "PCI:1:0:0" Option "ProbeAllGpus" "false" EndSection $ cat /etc/bumblebee/xorg.conf.nvidia Section "ServerLayout" Identifier "Layout0" Option "AutoAddDevices" "false" Option "AutoAddGPU" "false" EndSection Section "Device" Identifier "DiscreteNvidia" Driver "nvidia" VendorName "NVIDIA Corporation" BusID "PCI:2:0:0" Option "ProbeAllGpus" "false" Option "NoLogo" "true" Option "UseEDID" "false" Option "UseDisplayDevice" "none" EndSection $ ls /usr/share/vulkan/icd.d/ intel_icd.x86_64.json nvidia_icd.json nv_vulkan_wrapper.json radeon_icd.x86_64.json $ cat /usr/share/vulkan/icd.d/nvidia_icd.json { "file_format_version" : "1.0.0", "ICD": { "library_path": "libGLX_nvidia.so.0", "api_version" : "1.2.133" } } $ cat /usr/share/vulkan/icd.d/nv_vulkan_wrapper.json { "file_format_version" : "1.0.0", "ICD": { "library_path": "libnv_vulkan_wrapper.so.1", "api_version" : "1.1.84" } } ```
jonpas commented 3 years ago

Nice! However, I am getting a segfault with that:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7aff401 in StaticInitialize::StaticInitialize (this=0x7ffff7b02140 <init>) at nv_vulkan_wrapper.cpp:47
47      GLXFBConfig *fbc = glXChooseFBConfig(dpy, DefaultScreen(dpy), 0, &nelements);

My details are identical to yours (with different IDs).

felixdoerre commented 3 years ago

Interesting, can you provide more details about the segfault? What is null? dpy? Do you still have the (local) patch of changing NV_BUMBLEBEE_DISPLAY to some bogus value? That change should be reverted. You will need NV_BUMBLEBEE_DISPLAY set to :8.

jonpas commented 3 years ago

Ah yes, that was stupid of me to forget! Completely forgot about it due to the new commit.

$ PRIMUS_VK_DISPLAYID=10de:128b PRIMUS_VK_RENDERID=10de:1c03 pvkrun vkcube                                              (nv_display_fixes) ~/Work/Linux/VFIO/primus_vk
0x55a573d413a8
PrimusVK: Searching for display GPU:
PrimusVK: 0x55a573cf7ae0: 4318;7171
PrimusVK: 0x55a573cf7b10: 4318;4747
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GT 710
PrimusVK:   Type: 2
PrimusVK: Searching for render GPU:
PrimusVK: 0x55a573cf7ae0.
PrimusVK: Got device from env!
PrimusVK: Device: GeForce GTX 1060 6GB
PrimusVK:   Type: 2
PrimusVK: fetching dispatch for 0x55a573c5a0b0
PrimusVK: Creating display device finished!: 0
PrimusVK: fetching dispatch for 0x55a573d07b50
PrimusVK: CreateDevice done
PrimusVK: Querying surface formats (base) returned: 0
PrimusVK: Querying surface formats (base) returned: 0
PrimusVK: Querying surface formats: 2

Works wonders now, awesome work, it is rare to see such in-depth debug help!

I will also test any additional changes, if you will make any, and when it reaches master.

felixdoerre commented 3 years ago

Ooh, I just noticed, the bogus display, was actually my fault, as I just blindly rebased onto the "old" nv_display_fixes, not noticing that there still was that blind try inside it. I am currently still thinking on how this would be best merged into master. Probably this "alternative" way of obtaining the nvidia driver is more reliable in general and I'd suggest it also for all other users. But I guess I'd need to clean up the code and use it a few days myself on to be sure :D.

jonpas commented 3 years ago

Haha, no I've seen it, just didn't think on it, so no worries.

Yeah, let me know here and I'll help testing. :)

felixdoerre commented 3 years ago

I have already cleaned up the nv-fixes for a few days now and have used them myself to just get a feeling on how stable it is. Now I feel confident enough that they are fit for everyone. So could you re-test master?

jonpas commented 3 years ago

Tested vkcube and a Steam title, still works well!

felixdoerre commented 3 years ago

I am closing this issue as we resolved the issue. Feel free to reopen or create a new one if you have other issues.