vulkan info exits early if instance creation fails for one GPU of several #491

Open brianpaul opened 3 years ago

brianpaul commented 3 years ago

I have a multi-GPU setup (built-in Intel GPU, external GPU enclosure with AMD). If both the Intel and AMD GPUs are available (powered on, kernel modules loaded, etc), vulkaninfo works as expected, printing details of both GPUs.

However, if the external GPU is not available, vulkaninfo exits early with an error:

/build/vulkan-tools- failed with ERROR_OUT_OF_HOST_MEMORY

No info about the available Intel GPU is printed.

The problem is caused by two issues:

  1. I think the AMDVLK driver is erroneously returning VK_ERROR_OUT_OF_MEMORY instead of VK_ERROR_INITIALIZATION_FAILED. I'll contact AMD about that.
  2. Unlike other error codes, vulkaninfo gives up upon VK_ERROR_OUT_OF_MEMORY. See

I looked at commit 7fc1edea087f77c165fdfad060bc07481526b39e but it's not clear to me why VK_ERROR_OUT_OF_MEMORY is handled specially. My issue is fixed if I simply don't check for VK_ERROR_OUT_OF_MEMORY.

charles-lunarg commented 3 years ago

/build/vulkan-tools- failed with ERROR_OUT_OF_HOST_MEMORY That implies the vulkaninfo version used is 1.1.130, which is missing some of the changes I've made to vulkaninfo since then. Not to mention running a newer loader might solve the problem as well.

Is the AMD switchable graphics layer present at all? Maybe thats what originally returning the error.

Vulkaninfo generally throws its hands up in the air if any vulkan function fails, because if it did continue, it might hard crash later or report incorrect information.

brianpaul commented 3 years ago

Sorry, I don't know what the "AMD switchable graphics layer" is.

I'm using the latest vulkan loader and tools trees. The error message is an example. It's the same with the latest code.

I understand vulkaninfo throwing up its hands if some things fail, but I've hacked the code so that the VK_ERROR_OUT_OF_MEMORY I described above is not special-cased by the loader and then it works as I'd expect.

I guess should have probably filed this issue with the loader and not tools.

charles-lunarg commented 3 years ago

Ah now that I see you were referring to loader code, rather than vulkaninfo code, the changes you made make sense.

Looking at the loader logic there, I think the 'bail on OUT_OF_HOST_MEMORY ' (OOHM) is intended, as that error is used to signal that malloc has failed, and if the driver can't do what it needs to and returns OOHM, then neither the loader can. A driver returning INITIALIZATION_FAILED (INIT_FAILED) then being skipped over is consistent since it means that specific driver didn't succeed (and we should remove it from the list of enabled drivers) and then try to load the other drivers on the system.

If AMDVLK is indeed returning OOHM when it should be returning INIT_FAILED, then vulkaninfo shouldn't be affected. Though, if AMDVLK returns INIT_FAILED but the intel drivers aren't reported, then something else is amiss.

brianpaul commented 3 years ago

Yeah, I think the root bug may be in the AMDVLK driver and I've reported it to them. But I have a hunch they're going to say that it's a loader bug.

IMHO, it's seems very unlikely that the driver would really run out of host memory during vkCreateInstance. "host memory" here means ordinary heap memory in the process, right?

charles-lunarg commented 3 years ago

Yes, Host memory should refer to regular malloc'd memory. Also yes, the driver really shouldn't be returning OOHM, as its pretty darn rare in practice, especially with virtual memory in the mix. BUT this wouldn't be the first time drivers or the loader returned the wrong error code, so it doesn't surprise me if that did happen.

H5117 commented 3 years ago

I have the similar issue, but with a different error.

$ lspci|grep VGA
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Desktop)
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 730] (rev a1)

$ vulkaninfo
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.172/vulkaninfo/vulkaninfo.h:248:vkGetPhysicalDeviceSurfaceFormats2KHR failed with ERROR_INITIALIZATION_FAILED

$ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json vulkaninfo
ERROR at /build/vulkan-tools/src/Vulkan-Tools-1.2.172/vulkaninfo/vulkaninfo.h:248:vkGetPhysicalDeviceSurfaceFormats2KHR failed with ERROR_INITIALIZATION_FAILED
Arch Linux vulkan-icd-loader 1.2.172-1 vulkan-intel 20.3.4-3 nvidia 460.67-2 nvidia-utils 460.67-1 vulkan-tools 1.2.172-1 linux 5.11.8.arch1-1

charles-lunarg commented 3 years ago

@H5117 The [GF 108]GeForce GT 730 does not support vulkan Seems I was looking at the wrong GPU, the 208 indeed does support vulkan, I was looking at the 108 which doesn't. vulkaninfo requires at least one valid GPU to run. Except, the vulkan-loader is responsible for finding 'valid vulkan drivers' on the system. It seems that it considers the nvidia driver to be valid, which then this driver returns a valid VkPhysicalDevice, that vulkaninfo can use. vkGetPhysicalDeviceSurfaceFormats2KHR is crashing when using this physical device.

Can you set the env-var VK_LOADER_DEBUG=all, run vulkaninfo again, and return the output generated?

H5117 commented 3 years ago

@charles-lunarg Here is the output: vulkaninfo.txt. vkcube also works only with explicit selection of the Intel GPU. And segfaults by default:

$ vkcube
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

Selected GPU 1: NVIDIA GeForce GT 730, type: 2
Can't find our preferred formats... Falling back to first exposed format. Rendering may be incorrect.
Segmentation fault (core dumped)
charles-lunarg commented 3 years ago

This looks more and more like an issue with the driver. The Nvidia driver should either: not be found because it doesn't support vulkan or not report support for any physical devices. However I cannot rule out the possibility that a loader bug is causing this issue. But generally speaking, only SDK versions of the loader & tooling is validated. Using individual header updates means you are liable to include bugs that were introduced but fixed during SDK. Can you update to 1.2.176 and rerun the code?

pdaniell-nv commented 3 years ago

The "NVIDIA Corporation GK208B [GeForce GT 730]" device should support Vulkan. I would be interested in seeing the callstack for the crash.

H5117 commented 3 years ago

The same behavior with vulkan-icd-loader 1.2.176-1 and vulkan-tools 1.2.176-1.

Stack trace with -DCMAKE_BUILD_TYPE=Release
Thread 1 "vkcube" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff401766f in ?? () from /usr/lib/
#2 0x00007ffff400adac in ?? () from /usr/lib/
#3 0x00007ffff4017ba3 in ?? () from /usr/lib/
Stack trace with -DCMAKE_BUILD_TYPE=Debug
Thread 1 "vkcube" received signal SIGABRT, Aborted.
#0 0x00007ffff7c33ef5 in raise () from /usr/lib/
#1 0x00007ffff7c1d862 in abort () from /usr/lib/
#2 0x00007ffff7c1d747 in __assert_fail_base.cold () from /usr/lib/
#3 0x00007ffff7c2c646 in __assert_fail () from /usr/lib/
#4 0x0000555555559e4e in demo_init_vk_swapchain (demo=0x7fffffffd500) at /usr/src/debug/Vulkan-Tools-1.2.176/cube/cube.c:3685
#5 main (argc=, argv=) at /usr/src/debug/Vulkan-Tools-1.2.176/cube/cube.c:4202
H5117 commented 3 years ago

Maybe it is worth to note that I don't have a monitor attached to the Nvidia card, it is used as OpenCL device only. But IMHO vulkaninfo should work in this case, and vkcube should not crash.

charles-lunarg commented 3 years ago

In this case, vkcube is crashing because a call to vkGetPhysicalDeviceSurfaceFormatsKHR is returning a non-success value, which indicates that something related to the surface isn't working. So its less crashing and more just failing an assert. I do agree that the error reporting could be better, but I assert (heh) that vkcube did what it could to verify that the system can support surfaces (by verifying if VK_KHR_surface and the platform specific surface extension are present and enabled), and then attempted to query the surface info (formats, support, capabilities, etc) and thats when it failed.

I am not the vkcube maintainer, so my experience with that codebase is limited, as such it is very feasible that vkcube could be doing more to ensure that it works.

As for vulkaninfo, that definitely is an issue, vulkaninfo should be more resilient to faults. Though, if there is an issue where the vulkan-loader reports support for surface extensions but crashes in calls to them (ie what vkcube could be suffering from), then vulkaninfo has the same limitation of only being able to check for those extensions to determine support.

kkartaltepe commented 3 years ago

The spec declares surface must be supported by physicalDevice, as reported by vkGetPhysicalDeviceSurfaceSupportKHR or an equivalent platform-specific mechanism and at this point vulkaninfo has in-fact not called vkGetPhysicalDeviceSurfaceSupportKHR.

If we do attempt to call vkGetPhysicalDeviceSurfaceSupportKHR for every queue when using nvidia's drivers we will find that nvidia is happy to report that present is not supported on any queue for some surface types. These are the surfaces for which errors are reported where vulkaninfo doesnt expect them.

I would imagine that this makes vkcube successfully presenting frames of this surface on a queue out of spec but I dont pretend to know the infinite wisdom of the spec authors and nvidia engineers. It seems vkcube uses a surface type chosen at compile time which nvidia does support present for and gives up complaining it couldnt find appropriate queues if you change to the troublesome surface type. If we instead give up on querying PhysicalDeviceSurface information in AppSurface if vkGetPhysicalDeviceSurfaceSupportKHR returns false for queue 0 (or maybe all of them) everything else completes successfully.