GPUOpen-Tools / GPU-Reshape

GPU Reshape (GRS) is an API & vendor agnostic instrumentation framework, with instruction level validation.
Other
374 stars 12 forks source link

Vulkan - having Khronos validation layer fails instance creation #52

Closed ColumbusUtrigas closed 7 months ago

ColumbusUtrigas commented 8 months ago

VK_LAYER_KHRONOS_validation

miguel-petersen commented 8 months ago

Hi! Could you describe how you enable the layer? Manually adding it to the instance layers, Vulkan Configurator, etc.?

Additionally, what Vulkan SDK are you on?

ColumbusUtrigas commented 8 months ago

Hi!

Vulkan SDK 1.3.246.1, manually adding "VK_LAYER_KHRONOS_validation" and putting it into VkInstanceCreateInfo, then on vkCreateInstance I get VK_ERROR_LAYER_NOT_PRESENT. Not using Vulkan configurator.

miguel-petersen commented 8 months ago

I'm not sure if the vulkan loader has this feature on by default, but just in case, could you run your app with the environment variable "VK_LOADER_DEBUG" set to "all"?

Reshape currently doesn't let you override env vars through the Launch dialog, but Discovery (the icon top right, looks like a radar) would let you launch your app from VS (or whichever IDE).

If it works, you should see a bunch of logging. It'll appear in the VS Output Window / LLDB window.

If you are comfortable with it, could you paste that text here? If there's any confidential information, try to see if it's rejecting the validation dll for any reason, and if the path it resides in is reported or not.

ColumbusUtrigas commented 8 months ago

Thanks, not sure what needs to be included so here is a file with an entire output.

output.txt

The last exception seems to be an another issue that I've noticed: crash on vkCmdBindPipeline with VK_PIPELINE_BIND_POINT_RAY_TRACING_KHR. This RT Pipeline has 3 shading groups with 1 shader in each and an SBT without any custom data packed into it with default strides.

miguel-petersen commented 8 months ago

What I find strange is that it seems to find it.

LAYER:            vkCreateInstance layer callstack setup to:
LAYER:               <Application>
LAYER:                 ||
LAYER:               <Loader>
LAYER:                 ||
LAYER:               VK_LAYER_NV_optimus
LAYER:                       Type: Implicit
LAYER:                           Disable Env Var:  DISABLE_LAYER_NV_OPTIMUS_1
LAYER:                       Manifest: C:\WINDOWS\System32\DriverStore\FileRepository\nvquui.inf_amd64_a1f26b3e707cf15b\nv-vk64.json
LAYER:                       Library:  C:\WINDOWS\System32\DriverStore\FileRepository\nvquui.inf_amd64_a1f26b3e707cf15b\.\nvoglv64.dll
LAYER:                 ||
LAYER:               VK_LAYER_GPUOPEN_GRS
LAYER:                       Type: Implicit
LAYER:                           Disable Env Var:  DISABLE_VK_LAYER_GPUOpen_GRS
LAYER:                       Manifest: C:\Users\Columbus\Desktop\GPUReshape\VK_LAYER_GPUOPEN_GRS.json
LAYER:                       Library:  C:\Users\Columbus\Desktop\GPUReshape\.\GRS.Backends.Vulkan.Layer.dll
LAYER:                 ||
LAYER:               VK_LAYER_KHRONOS_validation
LAYER:                       Type: Explicit
LAYER:                       Manifest: C:\VulkanSDK\1.3.246.1\Bin\VkLayer_khronos_validation.json
LAYER:                       Library:  C:\VulkanSDK\1.3.246.1\Bin\.\VkLayer_khronos_validation.dll
LAYER:                 ||
LAYER:               <Drivers>

INFO | LAYER:     Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.3.246.1\Bin\.\VkLayer_khronos_validation.dll)
INFO | LAYER:     Inserted device layer "VK_LAYER_GPUOPEN_GRS" (C:\Users\Columbus\Desktop\GPUReshape\.\GRS.Backends.Vulkan.Layer.dll)
INFO | LAYER:     Inserted device layer "VK_LAYER_NV_optimus" (C:\WINDOWS\System32\DriverStore\FileRepository\nvquui.inf_amd64_a1f26b3e707cf15b\.\nvoglv64.dll)
DRIVER | LAYER:   vkCreateDevice layer callstack setup to:
DRIVER | LAYER:      <Application>
DRIVER | LAYER:        ||
DRIVER | LAYER:      <Loader>
DRIVER | LAYER:        ||
LAYER:               VK_LAYER_NV_optimus
LAYER:                       Type: Implicit
LAYER:                           Disable Env Var:  DISABLE_LAYER_NV_OPTIMUS_1
LAYER:                       Manifest: C:\WINDOWS\System32\DriverStore\FileRepository\nvquui.inf_amd64_a1f26b3e707cf15b\nv-vk64.json
LAYER:                       Library:  C:\WINDOWS\System32\DriverStore\FileRepository\nvquui.inf_amd64_a1f26b3e707cf15b\.\nvoglv64.dll
LAYER:                 ||
LAYER:               VK_LAYER_GPUOPEN_GRS
LAYER:                       Type: Implicit
LAYER:                           Disable Env Var:  DISABLE_VK_LAYER_GPUOpen_GRS
LAYER:                       Manifest: C:\Users\Columbus\Desktop\GPUReshape\VK_LAYER_GPUOPEN_GRS.json
LAYER:                       Library:  C:\Users\Columbus\Desktop\GPUReshape\.\GRS.Backends.Vulkan.Layer.dll
LAYER:                 ||
LAYER:               VK_LAYER_KHRONOS_validation
LAYER:                       Type: Explicit
LAYER:                       Manifest: C:\VulkanSDK\1.3.246.1\Bin\VkLayer_khronos_validation.json
LAYER:                       Library:  C:\VulkanSDK\1.3.246.1\Bin\.\VkLayer_khronos_validation.dll
LAYER:                 ||
DRIVER | LAYER:      <Device>
miguel-petersen commented 8 months ago

I've tried reproducing your issue, but seem unable to so far.

Is there a sample I could try locally? Ideally something I can build myself?

ColumbusUtrigas commented 8 months ago

Hello, sorry for long response.

For some reason it works when connecting to a running process. But with enabled discovery it produces a validation error on device creation (VUID-VkDeviceCreateInfo-pNext-02830) even though none of the complaints are true (and it doesn't complain with discovery disabled). Also, as I mentioned above, it produces an exception on binding RT pipeline.

https://github.com/ColumbusUtrigas/ColumbusEngine/tree/new-render "new-render" branch, requires a bunch of submodules, build the main project, then build "Shaders" project (requires VK SDK with both dxc and glslc) and copy over full "PrecompiledShaders" folder into a directory with the binary, run both "do_always1" and "do_always2" targets (they copy over Data folder).

In Core/Windows/WindowsMain2.cpp around line 530 point it to some GLTF2.0 scene on your machine.

In Graphics/Vulkan/InstanceVulkan.h:44 I add the layer.

Note: was able to reproduce it on newer hardware with NVidia driver v536.67

miguel-petersen commented 8 months ago

No worries!

Thanks for the sample, I'll see if I can reproduce the issue locally. 🙂

miguel-petersen commented 8 months ago

Hmm, I built it from source as described, however it seems to add the validation layer on my end.

On the raytracing issues, would you have the time to try building the branch below? https://github.com/GPUOpen-Tools/GPU-Reshape/tree/issue/49-vkQueueSubmit2

ColumbusUtrigas commented 8 months ago

Hmm, strange, I have two machines (both nvidia) that still reproduce it, on that branch as well.

I had a few debugging sessions, tell me if I can do or test something else. I'll try running it on AMD in the near future, but here are results for now.

Missing layer problem is strange. I wait for debugger to attach on start so I could run the program from GRS (which also is running with the debugger). I was hoping to intercept Hook_vkCreateInstance, but can't seem to find the GRS dll being loaded at that time (in both processes), so I can't place a breakpoint. (Which is strange, because I expected it to inject the DLL before instance creation comes into place, and tell me if I can do something else to debug this problem on my side).

Device creation validation error is also still present. It is produced in Device.cpp:297, Hook_vkCreateDevice of GRS.Backends.Vulkan.Layer.dll. I see on lines 240-244 there are a few features added into a chain. As I use VkPhysicalDeviceVulkan12Features to initialise device on my side, it's illegal to add physicalDeviceDescriptorIndexingFeatures. But it doesn't seem to crash anything, just a little annoying.

Don't seem to reproduce raytracing issue anymore on this branch, but manual instrumentation of any shader during the runtime produces validation errors. Here is an example when trying to instrument Decals.hlsl for resource bounds:

[ERROR]: VALIDATION {VUID-VkShaderModuleCreateInfo-pCode-01379} Validation Error: [ VUID-VkShaderModuleCreateInfo-pCode-01379 ] Object 0: handle = 0x17fdb02de70, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x2a1bf17f | SPIR-V module not valid: Structure id 106 decorated as Block must be explicitly laid out with MatrixStride decorations.
  %_struct_106 = OpTypeStruct %mat4v4float %mat4v4float %mat4v4float %uint %uint %uint %uint
 The Vulkan spec states: If pCode is a pointer to GLSL code, it must be valid GLSL code written to the GL_KHR_vulkan_glsl GLSL extension specification (https://vulkan.lunarg.com/doc/view/1.3.250.1/windows/1.3-extensions/vkspec.html#VUID-VkShaderModuleCreateInfo-pCode-01379)

(This error doesn't produce crashes.) Similar error is produced with Export Stability, Descriptor and Concurrency instrumentation. They don't crash my program.

Loop instrumentation seems to work fine for it.

Initialisation instrumentation leads to this error being produced constantly:

[ERROR]: VALIDATION {VUID-vkCmdDispatch-None-02699} Validation Error: [ VUID-vkCmdDispatch-None-02699 ] Object 0: handle = 0xa5710400000451ea, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0xe5d1743c | Descriptor set VkDescriptorSet 0xa5710400000451ea[] encountered the following validation error at vkCmdDispatch time: Descriptor in binding #8 index 0 is being used in draw but has never been updated via vkUpdateDescriptorSets() or a similar call. The Vulkan spec states: Descriptors in each bound descriptor set, specified via vkCmdBindDescriptorSets, must be valid as described by descriptor validity if they are statically used by a bound shader (https://vulkan.lunarg.com/doc/view/1.3.250.1/windows/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-02699)

Callstack screenshot: image

ColumbusUtrigas commented 8 months ago

^^ Also, some type of instrumentation on my decal shader (presumeable vertex shader, as it had a lower number in the list) seemed to change the logic, so that it behaves incorrectly. It looks like matrix operation isn't correct now, so I can see noticeable clipping.

ColumbusUtrigas commented 8 months ago

Also, if switching to path tracing ("r.Render 1" command) and instrumenting one of the shaders, it produces the similar push constants validation error:

[ERROR]: VALIDATION {VUID-vkCmdDispatch-None-02699} Validation Error: [ VUID-vkCmdDispatch-None-02699 ] Object 0: handle = 0xb7fb3800000d91c3, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0xe5d1743c | Descriptor set VkDescriptorSet 0xb7fb3800000d91c3[] encountered the following validation error at vkCmdDispatch time: Descriptor in binding #8 index 0 is being used in draw but has never been updated via vkUpdateDescriptorSets() or a similar call. The Vulkan spec states: Descriptors in each bound descriptor set, specified via vkCmdBindDescriptorSets, must be valid as described by descriptor validity if they are statically used by a bound shader (https://vulkan.lunarg.com/doc/view/1.3.250.1/windows/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-02699)

In UserCommandBuffer.cpp:75, bindState.pipeline->layout->userPushConstantLength equals to 0

miguel-petersen commented 8 months ago

Thanks for the wealth of information! 🙂

"Which is strange, because I expected it to inject the DLL before instance creation comes into place"

On Vulkan, the loader (Khronos) is responsible for loading any dlls for interception. So it may happen inside of that call. DX12 hooking is radically different, bootstrapping has to happen almost first-thing in the application. Vulkan has a little more freedom.

I'm not really sure what's happening right now (regarding the loader), I'll mull over it a bit.

"it's illegal to add physicalDeviceDescriptorIndexingFeatures. But it doesn't seem to crash anything, just a little annoying."

I'll keep you posted on everything.

miguel-petersen commented 8 months ago

Fixed the descriptor issue, push constant is next.

miguel-petersen commented 8 months ago

Fixed the push constant issue, I was not properly migrating all the member decorations.

miguel-petersen commented 8 months ago

Indexing validation error should be fixed now.

miguel-petersen commented 7 months ago

@ColumbusUtrigas Would it be possible to check if the issues are fixed on your end? Thanks! 🙂

miguel-petersen commented 7 months ago

I'm closing this as completed. Feel free to reopen it if the issue persists.