NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 43 forks source link

vk_icdGetInstanceProcAddr fails to return a valid vkCreateInstance when drivers are not (fully) initialised #97

Closed tim-rex closed 7 months ago

tim-rex commented 7 months ago

This may not be the right project for this issue, please advise if that is the case as this is very probably for the driver team. It's quite minor, as things go.

I have two scenarios where vulkaninfo will report

ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0

Scenario 1) When the nVidia drivers are not loaded at all (example: A dual GPU configuration and booting up with nvidia drivers disabled).

Secnario 2) When the nVidia drivers are enabled but have not been fully initialized. A bit of a deep dive this one. See thread here

The second scenario is perhaps harder to reason about, so I'll focus on the first. I have the nvidia drivers blacklisted, and can confirm they are not loaded. Nothing will attempt to load them and nvidia-modprobe is out of the picture.

Under normal circumstances, a tool such as vulkaninfo would query all available devices via their ICD. In the case where nvidia drivers are not loaded at all, I would expect no such errors and for there simply to not be a device to query.

I'm unfamiliar with what the ICD loader should expect from a driver in a scenario where the drivers are not loaded at all, however I note that when I am running with nvidia drivers only, there are no such errors forthcoming from the radeon ICD.

I could work around this myself of course (various overrides mechanisms defined here)

Looking at the documentation for Driver Entry Point Discovery , it states:

VKAPI_ATTR PFN_vkVoidFunction VKAPI_CALL
   vk_icdGetInstanceProcAddr(
      VkInstance instance,
      const char* pName);

This function has very similar semantics to vkGetInstanceProcAddr. vk_icdGetInstanceProcAddr returns valid function pointers for all the global-level and instance-level Vulkan functions, and also for vkGetDeviceProcAddr. Global-level functions are those which contain no dispatchable object as the first parameter, such as vkCreateInstance and vkEnumerateInstanceExtensionProperties.

I suppose that means gkGetInstanceProcAddr should be available from _libGLXnvidia.so.0 via _vkicdGetInstanceProcAddr, regardless of wether the drivers are fully initialised or not. Beyond that, I'm not sure how the device not being available should be negotiated.

Here's the full output of vulkaninfo

[~]$ vulkaninfo --summary
ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.268

Instance Extensions: count = 23
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 12
---------------------------
VK_LAYER_INTEL_nullhw             INTEL NULL HW                              1.1.73   version 1
VK_LAYER_KHRONOS_profiles         Khronos Profiles layer                     1.3.268  version 1
VK_LAYER_KHRONOS_shader_object    Khronos Shader object layer                1.3.268  version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer             1.3.268  version 1
VK_LAYER_KHRONOS_validation       Khronos Validation Layer                   1.3.268  version 1
VK_LAYER_LUNARG_api_dump          LunarG API dump layer                      1.3.268  version 2
VK_LAYER_LUNARG_gfxreconstruct    GFXReconstruct Capture Layer Version 1.0.1 1.3.268  version 4194305
VK_LAYER_LUNARG_monitor           Execution Monitoring Layer                 1.3.268  version 1
VK_LAYER_LUNARG_screenshot        LunarG image capture layer                 1.3.268  version 1
VK_LAYER_MESA_device_select       Linux device selection layer               1.3.211  version 1
VK_LAYER_MESA_overlay             Mesa Overlay layer                         1.3.211  version 1
VK_LAYER_NV_optimus               NVIDIA Optimus layer                       1.3.260  version 1

Devices:
========
GPU0:
    apiVersion         = 1.3.255
    driverVersion      = 23.2.1
    vendorID           = 0x1002
    deviceID           = 0x67df
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = AMD Radeon RX 580 Series (RADV POLARIS10)
    driverID           = DRIVER_ID_MESA_RADV
    driverName         = radv
    driverInfo         = Mesa 23.2.1-arch1.2
    conformanceVersion = 1.2.7.1
    deviceUUID         = 00000000-0200-0000-0000-000000000000
    driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
    apiVersion         = 1.3.255
    driverVersion      = 0.0.1
    vendorID           = 0x10005
    deviceID           = 0x0000
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 16.0.6, 256 bits)
    driverID           = DRIVER_ID_MESA_LLVMPIPE
    driverName         = llvmpipe
    driverInfo         = Mesa 23.2.1-arch1.2 (LLVM 16.0.6)
    conformanceVersion = 1.3.1.1
    deviceUUID         = 6d657361-3233-2e32-2e31-2d6172636800
    driverUUID         = 6c6c766d-7069-7065-5555-494400000000

Here's the nvidia ICD definition

[~]$ cat /usr/share/vulkan/icd.d/nvidia_icd.json 
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.260"
    }
}
erik-kz commented 7 months ago

Would you mind sending this as an email to "vulkan-support@nvidia.com"? That is our mailing list for Vulkan developer support. It's monitored by folks who should be able to answer your question.

tim-rex commented 7 months ago

Thanks @erik-kz , much appreciated. I'll close this issue and report back if anything relevant comes back.

tim-rex commented 6 months ago

Response from nVidia follows

We know what causes this issue, and we're discussing internally whether it should be considered a loader bug or a driver bug (I.e., whether it's valid for an ICD to return NULL when queried for these bootstrap-level function pointers if driver init has failed).