KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
777 stars 407 forks source link

RuntimeSpirv validation is sometimes too broad #6961

Open Rua opened 1 year ago

Rua commented 1 year ago

The current code that validates against VUID-RuntimeSpirv-Workgroup-06530, VUID-RuntimeSpirv-maxMeshSharedMemorySize-08754 and VUID-RuntimeSpirv-maxTaskSharedMemorySize-08759 is too broad and sometimes generates false positives. The CalculateWorkgroupSharedMemory function uses a list of all global variable instructions to calculate the memory size, but it doesn't take the selected entry point and its interface into account.

For example, a shader module could contain both a task and a mesh shader entry point. This function will sum the sizes of the variables used for both entry points, and then compare that grand total to the maxMeshSharedMemorySize or maxTaskSharedMemorySize limit. It should really only be counting the variables that are actually used by the selected entry point (for a SPIR-V 1.4+ shader, variables that appear in the entry point's interface).

This is a problem that exists more generally with the RuntimeSpirv VUIDs, and I opened https://github.com/KhronosGroup/Vulkan-Docs/issues/2272 to bring attention to it. In the validation layer, the problem exists with at least these VUIDs but there's probably more.

spencer-lunarg commented 1 year ago

The main issue is a driver is fully legal to do what ever it wants with a shader at vkCreateShaderModule time, regardless of which entrypoints you may use later in the pipeline. There have been real driver crashes from not having some global limits validated at vkCreateShaderModule time

My goal was to split the logic into "things at vkCreateShaderModule" vs "things at pipeline time" (as we do everything at pipeline time currently)... this got tricky with VK_EXT_shader_object and GPU-AV needing to instrument shaders

As for your maxMeshSharedMemorySize/maxTaskSharedMemorySize example, it says

is the maximum total storage size, in bytes, available for variables declared with the Workgroup storage class in shader modules in the mesh shader stage.

so this says nothing about the entrypoint of the mesh shader, just that the VkShaderModule for VK_SHADER_STAGE_MESH_BIT_EXT has this limit, so from that, this is behaving as expected.

sometimes generates false positives.

I disagree this is a false positive, and not really willing to change this unless there are proper CTS test (and spec clarification) for making sure all drivers would correctly parse the entrypoint and do full static analysis to know how much workgroup size is being used

spencer-lunarg commented 1 year ago

created https://gitlab.khronos.org/vulkan/vulkan/-/issues/3696 to discuss in WG

pixelcluster commented 1 year ago

There have been real driver crashes from not having some global limits validated at vkCreateShaderModule time

Was this ever confirmed? vkCreateShaderModule on RADV is nothing more than a memcpy (plus hashing) of the SPIR-V, and I'm pretty sure that was always the case. vkCreateShaderModule crashing the GPU sounds highly unlikely to me.

spencer-lunarg commented 1 year ago

Was this ever confirmed?

I know I saw it crash and then it was updated and doesn't crash now (I guess could grab an old commit and see). But I also was surprised it crashed in vkCreateShaderModule... if this is something we can confirm is not a VVL issue, happy to just close that issue then

created https://gitlab.khronos.org/vulkan/vulkan/-/issues/3696 to discuss in WG

so we said we thinks this is indeed to be intended for per-Entrypoint and really this is just an ambiguity in the spec. I am tasked with making the spec change, confirming with everyone it is what we wanted and also get a CTS test for it.

I will try to find time soon(ish) to write a test and update this to be at an entrypoint level