Invalid "Out of bounds" BDA errors with GPUAV

spnda commented 5 months ago

Environment:

OS: Windows 11
GPU and driver version: RTX 3060 Ti, 552.31
SDK or header version if building from repo: Built freshly from commit b8422fa42c7fe50c831e16ea5a58971bc70c30f4, also present in at least SDK 1.3.283.
Options enabled (synchronization, best practices, etc.): GPUAV (just the "GPU-Assisted Preset")

Describe the Issue

Without the validation layers my application works as expected. However, when I enable GPU-Assisted through vkconfig I get Device address out of bounds errors, and the rendering stops working correctly. I suspect this to be caused by the VVL since this is the message that is printed:

Validation Error: [ UNASSIGNED-Device address out of bounds ] Object 0: handle = 0x21803220290, name = Graphics queue, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x1a898625 | vkCmdDrawMeshTasksEXT():  Out of bounds access: 336 bytes read at buffer device address 0x159012c0. Command buffer (0x218062d74d0). Draw Index 0. Pipeline (Visibility buffer mesh pipeline)(0xb6bee80000000073). Shader Module (0x44695a0000000071). Shader Instruction Index = 520.  Stage = MeshEXT. Global invocation ID (x, y, z) = (1985, 0, 0 ) Shader validation error occurred in file E:/git/spnda/vk_gltf_viewer/shaders/visbuffer/visbuffer.mesh.glsl at line 42.Unable to find suitable #line directive in SPIR-V OpSource.
    Objects: 1
        [0] 0x21803220290, VK_OBJECT_TYPE_QUEUE, Graphics queue

Note the 336 bytes read at buffer device address 0x159012c0. I have confirmed that that address is correct, however the struct in that buffer is only 160 bytes long. I managed to get around the issue by changing this:

    restrict const RealCamera camera = pushConstants.cameraBuffer.camera;
    mat4 mvp = camera.viewProjection * transformMatrix;

to just this:

    mat4 mvp = pushConstants.cameraBuffer.camera.viewProjection * transformMatrix;

With that code change the validation layers report nothing and rendering works as expected.

SPIR-V with no errors reported:

        %148 = OpAccessChain %_ptr_PushConstant__ptr_PhysicalStorageBuffer_CameraBuffer %_ %int_0 %int_4
        %149 = OpLoad %_ptr_PhysicalStorageBuffer_CameraBuffer %148
        %150 = OpAccessChain %_ptr_PhysicalStorageBuffer_mat4v4float %149 %int_0 %int_0
        %151 = OpLoad %mat4v4float %150 Aligned 4

SPIR-V with errors reported:

        %150 = OpAccessChain %_ptr_PushConstant__ptr_PhysicalStorageBuffer_CameraBuffer %_ %int_0 %int_4
        %151 = OpLoad %_ptr_PhysicalStorageBuffer_CameraBuffer %150
        %153 = OpAccessChain %_ptr_PhysicalStorageBuffer_RealCamera %151 %int_0
        %154 = OpLoad %RealCamera %153 Aligned 4
        %155 = OpCopyLogical %RealCamera_0 %154
               OpStore %camera %155
        %157 = OpAccessChain %_ptr_Function_mat4v4float %camera %int_0
        %158 = OpLoad %mat4v4float %157

A SPIR-V wizard on Discord suggested that this might be caused by the OpCopyLogical, since it seems like there's a difference with the copy, but I can't confirm this since my shader uses VK_EXT_mesh_shader which requires SPIR-V 1.4. The RealCamera struct looks like this:

struct RealCamera {
    mat4 viewProjection;
    vec4 frustum[6];
};
layout(buffer_reference, scalar, buffer_reference_align = 4) restrict readonly buffer CameraBuffer {
    RealCamera camera;
};

I've attached both SPIR-V files to this issue: spirv_files.zip

Expected behavior

Since the SPIR-V generated by glslang looks fine, I would expect the validation layers to report nothing in this case.

spencer-lunarg commented 5 months ago

first, thanks for all this detail and reporting this!!

doing a quick look I can see the issue i we do our shader instrumentation pass, it is giving the value of 336 instead of 64

Looking at the glsl I see the issue

    restrict const RealCamera camera = pushConstants.cameraBuffer.camera;
    mat4 mvp = camera.viewProjection * transformMatrix;

this first does a OpLoad that dereference the pointer, then it does another OpLoad on the it to get the viewProjection and we are mixing this up

I should be able to get a fix in today, mainly will just be adding more tests to catch these (and related) cases

spencer-lunarg commented 5 months ago

Update (and future notes for myself)

The core issue the SPIR-V ends up being

%camera = OpVariable %51 Function

%56 = OpAccessChain %TypeStruct %address
%57 = OpLoad %RealCamera %56 Aligned 4
OpStore %camera %57

%60 = OpAccessChain %mat4 %camera 0
%61 = OpLoad %11 %60 // camera.viewProjection

and the shader instrumentation is calling on TypeManager::TypeLength() on the OpTypeStruct it sees in the first load

... but I looked what RADV and Intel NIR produced and it shows that they never "load" the struct, but only the single field

so even if this example was

struct RealCamera {
    vec4 frustum[6];
    mat4 viewProjection;
};

it will calculate the offset of vec4 frustum[6] and do a @load_global on the offset pointer

tl;dr - for sure a bug on our end and I will have to account for "load of struct" actually just being a proxy for the "real" load

Hugobros3 commented 5 months ago

I don't think this partial-load behavior is a bug on the validation side, drivers optimizing away loads is platform-specific behavior that can't reliably be guessed, and loading something out-of-bounds even if unused is still UB.

What concerns me here is the fact the size of load (336 bytes) as stated in the OP is completely wrong, earlier on Discord it was 304 bytes. The RealCamera struct itself is only 160 bytes...

Hugobros3 commented 5 months ago

I might have the problem (certainly a problem):

case spv::OpTypeArray: {
    const Type* element_type = FindTypeById(type.inst_.Operand(0));
    const uint32_t count = element_type->inst_.Operand(0);
    return count * TypeLength(*element_type);
}

The computation for count seems wrong. The count is the second operand of the type, and it's not a literal but a constant/spec constant. The spirv file @spnda showed me used ID %19 for OpConstant %int 6, and well, 16 * 19 = 304 ... I think that's it!

For reference:

Hugobros3 commented 5 months ago

It appears this:

case spv::OpTypeArray: {
    const Type* element_type = FindTypeById(type.inst_.Operand(0));
    const Constant* count = FindConstantById(type.inst_.Operand(1));
    return count->inst_.Operand(0) * TypeLength(*element_type);
}

works.

A proper fix would include logic for pulling out the constant that also works with 64-bit values. I suppose spec constants are disallowed at this stage right?

spencer-lunarg commented 5 months ago

@Hugobros3 wow that is clearly a bug in the OpTypeArray logic and tried it and now prints (correctly) 160 instead of 336 (which I also realize now is a garbage value)

A proper fix would include logic for pulling out the constant that also works with 64-bit values

Yes, there is a TODO to handle 64-bit constant values elsewhere

I suppose spec constants are disallowed at this stage right?

You are allowed to use spec constant on this array, but not obvious, will adjust your logic to handle that case

spencer-lunarg commented 5 months ago

@spnda and @Hugobros3 thanks again for all this help and closing this issue as its scope is complete

So I went the side of caution and in #8075 I hopefully fixed the issue by suppressing checks on the edge cases, but I have a test (and issue tracker at #8089) to better understand why I am still getting some false positive with my original solution, but will be a bit until I get time on it.

KhronosGroup / Vulkan-ValidationLayers

Invalid "Out of bounds" BDA errors with GPUAV #8073