Closed spnda closed 5 months ago
first, thanks for all this detail and reporting this!!
doing a quick look I can see the issue i we do our shader instrumentation pass, it is giving the value of 336
instead of 64
Looking at the glsl I see the issue
restrict const RealCamera camera = pushConstants.cameraBuffer.camera;
mat4 mvp = camera.viewProjection * transformMatrix;
this first does a OpLoad
that dereference the pointer, then it does another OpLoad
on the it to get the viewProjection
and we are mixing this up
I should be able to get a fix in today, mainly will just be adding more tests to catch these (and related) cases
Update (and future notes for myself)
The core issue the SPIR-V ends up being
%camera = OpVariable %51 Function
%56 = OpAccessChain %TypeStruct %address
%57 = OpLoad %RealCamera %56 Aligned 4
OpStore %camera %57
%60 = OpAccessChain %mat4 %camera 0
%61 = OpLoad %11 %60 // camera.viewProjection
and the shader instrumentation is calling on TypeManager::TypeLength()
on the OpTypeStruct
it sees in the first load
... but I looked what RADV and Intel NIR produced and it shows that they never "load" the struct, but only the single field
so even if this example was
struct RealCamera {
vec4 frustum[6];
mat4 viewProjection;
};
it will calculate the offset of vec4 frustum[6]
and do a @load_global
on the offset pointer
tl;dr - for sure a bug on our end and I will have to account for "load of struct" actually just being a proxy for the "real" load
I don't think this partial-load behavior is a bug on the validation side, drivers optimizing away loads is platform-specific behavior that can't reliably be guessed, and loading something out-of-bounds even if unused is still UB.
What concerns me here is the fact the size of load (336 bytes) as stated in the OP is completely wrong, earlier on Discord it was 304 bytes. The RealCamera
struct itself is only 160 bytes...
I might have the problem (certainly a problem):
case spv::OpTypeArray: {
const Type* element_type = FindTypeById(type.inst_.Operand(0));
const uint32_t count = element_type->inst_.Operand(0);
return count * TypeLength(*element_type);
}
The computation for count
seems wrong. The count is the second operand of the type, and it's not a literal but a constant/spec constant. The spirv file @spnda showed me used ID %19 for OpConstant %int 6
, and well, 16 * 19 = 304 ... I think that's it!
For reference:
It appears this:
case spv::OpTypeArray: {
const Type* element_type = FindTypeById(type.inst_.Operand(0));
const Constant* count = FindConstantById(type.inst_.Operand(1));
return count->inst_.Operand(0) * TypeLength(*element_type);
}
works.
A proper fix would include logic for pulling out the constant that also works with 64-bit values. I suppose spec constants are disallowed at this stage right?
@Hugobros3 wow that is clearly a bug in the OpTypeArray
logic and tried it and now prints (correctly) 160
instead of 336
(which I also realize now is a garbage value)
A proper fix would include logic for pulling out the constant that also works with 64-bit values
Yes, there is a TODO to handle 64-bit constant values elsewhere
I suppose spec constants are disallowed at this stage right?
You are allowed to use spec constant on this array, but not obvious, will adjust your logic to handle that case
@spnda and @Hugobros3 thanks again for all this help and closing this issue as its scope is complete
So I went the side of caution and in #8075 I hopefully fixed the issue by suppressing checks on the edge cases, but I have a test (and issue tracker at #8089) to better understand why I am still getting some false positive with my original solution, but will be a bit until I get time on it.
Environment:
Describe the Issue
Without the validation layers my application works as expected. However, when I enable
GPU-Assisted
through vkconfig I getDevice address out of bounds
errors, and the rendering stops working correctly. I suspect this to be caused by the VVL since this is the message that is printed:Note the
336 bytes read at buffer device address 0x159012c0
. I have confirmed that that address is correct, however the struct in that buffer is only 160 bytes long. I managed to get around the issue by changing this:to just this:
With that code change the validation layers report nothing and rendering works as expected.
SPIR-V with no errors reported:
SPIR-V with errors reported:
A SPIR-V wizard on Discord suggested that this might be caused by the
OpCopyLogical
, since it seems like there's a difference with the copy, but I can't confirm this since my shader usesVK_EXT_mesh_shader
which requires SPIR-V 1.4. TheRealCamera
struct looks like this:I've attached both SPIR-V files to this issue: spirv_files.zip
Expected behavior
Since the SPIR-V generated by glslang looks fine, I would expect the validation layers to report nothing in this case.