NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
275 stars 43 forks source link

Swapchain creation on Wayland not possible #94

Closed wolfpld closed 2 months ago

wolfpld commented 7 months ago

Nvidia driver 545.29.02 fails with VK_ERROR_INITIALIZATION_FAILED when trying to create a swapchain. I do not believe there are any conditions listed in the documentation that would allow such a return value.

Please see the minimal (sigh) example below to reproduce the issue. The example follows the minimal path required to print the vkCreateSwapchainKHR return value and then exits. Physical devices are listed when the example is run, and you have to select one of them as the first parameter of the executable.

The example does the following:

  1. Establishes a minimal Wayland server connection, enough to get the wl_compositor and use it to create a wl_surface.
  2. Creates a VkInstance with the VK_KHR_surface and VK_KHR_wayland_surface instance extensions enabled.
  3. Creates a VkSurfaceKHR with vkCreateWaylandSurfaceKHR, using the wl_surface obtained earlier.
  4. A VkPhysicalDevice selection is made.
  5. The physical device must support the VK_KHR_swapchain device extension, which is checked.
  6. At least one queue family must support presenting on the obtained VkSurfaceKHR, which is checked with vkGetPhysicalDeviceSurfaceSupportKHR.
  7. VkDevice is created.
  8. Supported surface formats and required image counts are probed.
  9. VkSwapchainKHR is created.

My machine has two GPUs, an integrated AMD GPU that drives the display, and a dedicated Nvidia GPU.

This is the result of running with the Nvidia GPU:

% ./a.out 0
Found 3 physical devices
  Physical device 0: NVIDIA GeForce RTX 3050 Laptop GPU
  Physical device 1: AMD Radeon Graphics (RADV REMBRANDT)
  Physical device 2: llvmpipe (LLVM 16.0.6, 256 bits)
Using physical device 0
Can present on WaylandSurfaceKHR with queue family 0
Using surface format VK_FORMAT_A2B10G10R10_UNORM_PACK32 / VK_COLOR_SPACE_SRGB_NONLINEAR_KHR
Min image count: 2
vkCreateSwapchainKHR: VK_ERROR_INITIALIZATION_FAILED

And this is with the two remaining devices:

% ./a.out 1
Can present on WaylandSurfaceKHR with queue family 0
Using surface format VK_FORMAT_A2B10G10R10_UNORM_PACK32 / VK_COLOR_SPACE_SRGB_NONLINEAR_KHR
Min image count: 4
vkCreateSwapchainKHR: VK_SUCCESS
% ./a.out 2
Can present on WaylandSurfaceKHR with queue family 0
Using surface format VK_FORMAT_B8G8R8A8_SRGB / VK_COLOR_SPACE_SRGB_NONLINEAR_KHR
Min image count: 4
vkCreateSwapchainKHR: VK_SUCCESS

The example program follows:

// g++ nvidia.cpp `pkg-config --libs --cflags wayland-client vulkan`

#include <algorithm>
#include <array>
#include <stdio.h>
#include <string.h>
#include <vector>
#include <vulkan/vk_enum_string_helper.h>
#include <vulkan/vulkan.h>
#include <vulkan/vulkan_wayland.h>
#include <wayland-client.h>

wl_compositor* compositor = nullptr;

static void RegistryGlobal( void*, wl_registry* reg, uint32_t name, const char* interface, uint32_t version )
{
    if( strcmp( interface, "wl_compositor" ) == 0 )
    {
        compositor = (wl_compositor*)wl_registry_bind( reg, name, &wl_compositor_interface, 3 );
    }
}

constexpr wl_registry_listener registryListener = {
    .global = RegistryGlobal,
};

static bool IsExtensionAvailable( VkPhysicalDevice physDev, const char* extensionName )
{
    uint32_t count;
    vkEnumerateDeviceExtensionProperties( physDev, nullptr, &count, nullptr );
    std::vector<VkExtensionProperties> extensionProperties( count );
    vkEnumerateDeviceExtensionProperties( physDev, nullptr, &count, extensionProperties.data() );
    std::sort( extensionProperties.begin(), extensionProperties.end(),
        []( const auto& lhs, const auto& rhs ) { return strcmp( lhs.extensionName, rhs.extensionName ) < 0; } );
    auto it = std::lower_bound( extensionProperties.begin(), extensionProperties.end(), extensionName,
        []( const auto& lhs, const auto& rhs ) { return strcmp( lhs.extensionName, rhs ) < 0; } );
    return it != extensionProperties.end() && strcmp( it->extensionName, extensionName ) == 0;
}

int main( int argc, char** argv )
{
    auto dpy = wl_display_connect( nullptr );
    wl_registry_add_listener( wl_display_get_registry( dpy ), &registryListener, nullptr );
    wl_display_roundtrip( dpy );
    auto surface = wl_compositor_create_surface( compositor );

    constexpr std::array extensions = {
        VK_KHR_SURFACE_EXTENSION_NAME,
        VK_KHR_WAYLAND_SURFACE_EXTENSION_NAME,
    };
    VkInstanceCreateInfo instanceInfo = {
        .sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
        .enabledExtensionCount = extensions.size(),
        .ppEnabledExtensionNames = extensions.data()
    };
    VkInstance vkInstance;
    vkCreateInstance( &instanceInfo, nullptr, &vkInstance );

    VkWaylandSurfaceCreateInfoKHR createInfo = {
        .sType = VK_STRUCTURE_TYPE_WAYLAND_SURFACE_CREATE_INFO_KHR,
        .display = dpy,
        .surface = surface
    };
    VkSurfaceKHR vkSurface;
    vkCreateWaylandSurfaceKHR( vkInstance, &createInfo, nullptr, &vkSurface );

    uint32_t count;
    vkEnumeratePhysicalDevices( vkInstance, &count, nullptr );
    std::vector<VkPhysicalDevice> devices( count );
    vkEnumeratePhysicalDevices( vkInstance, &count, devices.data() );

    printf( "Found %i physical devices\n", count );
    for( uint32_t i=0; i<count; i++ )
    {
        VkPhysicalDeviceProperties props;
        vkGetPhysicalDeviceProperties( devices[i], &props );
        printf( "  Physical device %i: %s\n", i, props.deviceName );
    }
    if( argc < 2 )
    {
        printf( "Usage: %s <physical device index>\n", argv[0] );
        return 1;
    }
    const auto devsel = atoi( argv[1] );
    if( devsel >= count )
    {
        printf( "Invalid physical device index\n" );
        return 1;
    }
    printf( "Using physical device %i\n", devsel );

    VkPhysicalDevice physDev = devices[devsel];

    if( !IsExtensionAvailable( physDev, VK_KHR_SWAPCHAIN_EXTENSION_NAME ) )
    {
        printf( "VK_KHR_swapchain extension not supported by physical device\n" );
        return 1;
    }

    vkGetPhysicalDeviceQueueFamilyProperties( physDev, &count, nullptr );
    std::vector<VkQueueFamilyProperties> queueFamilyProperties( count );
    vkGetPhysicalDeviceQueueFamilyProperties( physDev, &count, queueFamilyProperties.data() );

    bool surfaceSupported = false;
    uint32_t queueIndex;
    for( queueIndex=0; queueIndex<count; queueIndex++ )
    {
        VkBool32 supported;
        vkGetPhysicalDeviceSurfaceSupportKHR( physDev, queueIndex, vkSurface, &supported );
        if( supported )
        {
            printf( "Can present on WaylandSurfaceKHR with queue family %i\n", queueIndex );
            surfaceSupported = true;
            break;
        }
    }
    if( !surfaceSupported )
    {
        printf( "Presenting on WaylandSurfaceKHR not supported\n" );
        return 1;
    }

    const float queuePriority = 1.0f;
    VkDeviceQueueCreateInfo queueInfo = {
        .sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,
        .queueFamilyIndex = queueIndex,
        .queueCount = 1,
        .pQueuePriorities = &queuePriority
    };
    constexpr std::array devExtensions = {
        VK_KHR_SWAPCHAIN_EXTENSION_NAME,
    };
    VkDeviceCreateInfo deviceInfo = {
        .sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
        .queueCreateInfoCount = 1,
        .pQueueCreateInfos = &queueInfo,
        .enabledExtensionCount = devExtensions.size(),
        .ppEnabledExtensionNames = devExtensions.data()
    };
    VkDevice vkDevice;
    vkCreateDevice( physDev, &deviceInfo, nullptr, &vkDevice );

    vkGetPhysicalDeviceSurfaceFormatsKHR( physDev, vkSurface, &count, nullptr );
    std::vector<VkSurfaceFormatKHR> surfaceFormats( count );
    vkGetPhysicalDeviceSurfaceFormatsKHR( physDev, vkSurface, &count, surfaceFormats.data() );
    printf( "Using surface format %s / %s\n", string_VkFormat( surfaceFormats[0].format ), string_VkColorSpaceKHR( surfaceFormats[0].colorSpace ) );

    VkSurfaceCapabilitiesKHR surfaceCaps;
    vkGetPhysicalDeviceSurfaceCapabilitiesKHR( physDev, vkSurface, &surfaceCaps );
    printf( "Min image count: %i\n", surfaceCaps.minImageCount );

    VkSwapchainCreateInfoKHR swapchainInfo = {
        .sType = VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR,
        .surface = vkSurface,
        .minImageCount = surfaceCaps.minImageCount,
        .imageFormat = surfaceFormats[0].format,
        .imageColorSpace = surfaceFormats[0].colorSpace,
        .imageExtent = { 640, 480 },
        .imageArrayLayers = 1,
        .imageUsage = VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT,
        .imageSharingMode = VK_SHARING_MODE_EXCLUSIVE,
        .preTransform = VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR,
        .compositeAlpha = VK_COMPOSITE_ALPHA_OPAQUE_BIT_KHR,
        .presentMode = VK_PRESENT_MODE_FIFO_KHR,
        .clipped = VK_TRUE,
    };

    VkSwapchainKHR vkSwapchain;
    auto res = vkCreateSwapchainKHR( vkDevice, &swapchainInfo, nullptr, &vkSwapchain );
    printf( "vkCreateSwapchainKHR: %s\n", string_VkResult( res ) );
}
erik-kz commented 7 months ago

Obligatory question... do you have the nvidia_drm module loaded with parameter "modeset=1"? You can check by reading /sys/module/nvidia_drm/parameters/modeset.

wolfpld commented 7 months ago

No, the value I get is N. When I enable it (the value becomes Y), then the vkCreateSwapchainKHR call succeeds.

I'd say this is highly unintuitive, as I would assume that in my case the mode setting is done by the AMD driver, not Nvidia's.

erik-kz commented 7 months ago

I'd say this is highly unintuitive

Yes, it is. Originally that parameter was just for modesetting functionality but over the years we've added other things that require it. Eventually it will become the default, but currently is can cause problems for some workstation SLI configurations.

wolfpld commented 7 months ago

Ok, so the driver uses DRM both for mode setting and for transferring images with PRIME, but naming things is hard and changing already established conventions will break things. That's understandable.

The problem is that the vkGetPhysicalDeviceSurfaceSupportKHR call tells that the driver is able to present on the surface, even if it isn't. Applications will typically implement some kind of GPU ranking system to select the best available GPU, and Nvidia will often win in this ranking.

The end result is that applications fail with a cryptic error that the documentation says shouldn't happen. The application could have used the other GPU instead if the Nvidia driver had told it correctly that it could not render on the surface provided.

erik-kz commented 7 months ago

That's a fair point. It should be possible for us to detect whether modeset is enabled during device initialization and only advertise support for Wayland surfaces if so. I've filed an internal bug to implement that.

vorporeal commented 3 months ago

Any update here?

Additionally, is there any way for an application developer to detect that this will occur, so we can skip over the nVidia Vulkan device and pick a different one? We can't simply read /sys/module/nvidia_drm/parameters/modeset, as that requires root privileges. I could check lsmod output to see if nvidia_drm is in the list, but that doesn't tell us anything about whether modesetting is enabled.

erik-kz commented 3 months ago

In the next major release, 555, we will not advertise support for Wayland surfaces when nvidia-drm is not loaded with modeset=1

vorporeal commented 3 months ago

Fantastic; glad to hear it!

Until that goes live and is adopted by distributions (I'm expecting it could take quite a while to come to Ubuntu 20.04, for example), how can we detect and/or work around this? I'd rather not blanket skip over any nVidia device when using Wayland.

erik-kz commented 3 months ago

Could you just have your application try a different device if swapchain creation fails?

vorporeal commented 3 months ago

Yeah, fair - the library we're using on top of Vulkan doesn't make this possible at the moment, but I can address the issue at that level. Thanks!

erik-kz commented 3 months ago

For interest's sake, internally we use a vendor-specific DRM ioctl to determine whether modeset=1 is set. Specifically, the supports_alloc field of DRM_IOCTL_NVIDIA_GET_DEV_INFO whose implementation you can find here https://github.com/NVIDIA/open-gpu-kernel-modules/blob/476bd34534a9389eedff73464d3f2fa5912f09ae/kernel-open/nvidia-drm/nvidia-drm-drv.c#L744

I would strongly discourage external applications or libraries from using that, though, since the interface is not guaranteed to be stable between driver versions. That's not an issue for us only because our user-space components and our kernel modules are version-locked.