Add VkPhysicalDevicePortabilitySubsetFeaturesKHR:: shaderTessellationInputOutputBarrier capability

billhollings commented 2 years ago

Some CTS tests fail when using barriers between tessellation stages.

This might be fixable with effort within the implementations.

In the meantime, these use cases can be disabled in CTS and reported as Not Supported, by adding a VkPhysicalDevicePortabilitySubsetFeaturesKHR:: shaderTessellationInputOutputBarrier capability.

This affects 1 CTS tests:

dEQP-VK.tessellation.shader_input_output.barrier
  Fail (Failure)

cdavis5e commented 1 year ago

I think I've figured out what's wrong. As I initially thought when I first investigated this, it is a miscompilation. In fact, unlike the other ones we've seen so far, this one is actually at the AIR level!

The shader in question has this code sequence:

in_te_attr[gl_InvocationID] = in_tc_attr[gl_InvocationID]
in_te_patchAddr = 0.0f;

barrier();

if (gl_InvocationID == 5)
        in_te_patchAttr = float(gl_InvocationID)*0.1;

barrier();

highp float temp = in_te_patchAttr + in_te_attr[gl_InvocationID];

This ultimately gets compiled down to this AIR with optimization; I've annotated the compiler-generated IR with the corresponding lines of the original GLSL shader:

; in_te_attr[gl_InvocationID] = in_tc_attr[gl_InvocationID]
%22 = zext i32 %8 to i64
%23 = getelementptr inbounds %struct.main0_in, %struct.main0_in addrspace(1)* %21, i64 %22
%24 = bitcast %struct.main0_in addrspace(1)* %23 to i32 addrspace(1)*
%25 = load i32, i32 addrspace(1)* %24, align 4, !tbaa !32
%26 = getelementptr inbounds %struct.main0_out, %struct.main0_out addrspace(1)* %11, i64 %22
%27 = getelementptr inbounds %struct.main0_out, %struct.main0_out addrspace(1)* %26, i64 0, i32 0
%28 = bitcast %struct.main0_out addrspace(1)* %26 to i32 addrspace(1)*
store i32 %25, i32 addrspace(1)* %28, align 4, !tbaa !35
; in_te_patchAddr = 0.0f;
%29 = getelementptr inbounds %struct.main0_out, %struct.main0_out addrspace(1)* %26, i64 0, i32 0
store float 0.000000e+00, float addrspace(1)* %29, align 4, !tbaa !37
; barrier();
tail call void @air.wg.barrier(i32 3, i32 1) #2
; if (gl_InvocationID == 5)
%30 = icmp eq i32 %8, 5
%31 = bitcast i32 %25 to float ; !!
br i1 %30, label %32, label %33

; <label>:32:
;         in_te_patchAttr = float(gl_InvocationID)*0.1;
store float 5.000000e-01, float addrspace(1)* %29, align 4, !tbaa !37
br label %33

; <label>:33:
%34 = phi float [ 5.000000e-01, %32 ], [ 0.000000e+00, %6 ] ; Copy propagated from above!!
; barrier();
tail call void @air.wg.barrier(i32 3, i32 1) #2
; highp float temp = in_te_patchAttr + in_te_attr[gl_InvocationID];
%35 = fadd fast float %34, %31 ; Copy propagated from above->BOOM!!

I've called out three lines of particular interest:

%31 = bitcast i32 %25 to float
%34 = phi float [ 5.000000e-01, %32 ], [ 0.000000e+00, %6 ]
%35 = fadd fast float %34, %31

Notice how, instead of going back to memory, the compiler propagates the stored expressions above to the add expression. It further doesn't realize that the output buffers are in fact shared with several other threads, so that value %34 could be replaced with a constant float 5.000000e-01; it thinks instead that each thread will only see the value that it wrote to the buffer. (Since this is difficult to get right in the general case, it's probably best to avoid copy propagation across barriers entirely.) The end result is that some of the tessellation control invocations wind up using the wrong values, which propagate down the pipeline.

Fortunately, this suggests a workaround: if we declare the output buffers as volatile, the compiler will not be able to copy-propagate them at all. This may inhibit opportunities for real compiler optimizations that aren't miscompilations, but it will make code relying on synchronization work correctly.

billhollings commented 1 year ago

Wow! Wonderful work! Thanks for such an in-depth level of research and reporting! I tip my hat to you and your detective skills! 🫡

I guess the %31 (in_te_attr) propagation is fine, because the buffer write is strictly dependent on gl_InvocationID, which means it won't be affected by other invocations. But definitely the %34 is going to propagate incorrectly as 0.0 in all but one invocation.

The volatile qualifier sounds like a good idea. To cover the cases we need covered, when would it be declared? On all writeable buffer variables? Or only if they appear as an LHS? Or only on variables that appear on both sides of a barrier()?, etc?

cdavis5e commented 1 year ago

The volatile qualifier sounds like a good idea. To cover the cases we need covered, when would it be declared? On all writeable buffer variables? Or only if they appear as an LHS? Or only on variables that appear on both sides of a barrier()?, etc?

To be honest, now that I've tried actually implementing this, I'm not so sure this is a good idea. The problem is that most structs, including POD structs, don't support being copied to or from volatile memory, which would necessitate elementwise copying. Of course, we'd need something like that anyway to properly implement OpCopyLogical, but still, this is starting to seem like more trouble than it's worth...

billhollings commented 9 months ago

Fixed.

KhronosGroup / Vulkan-Portability

Add VkPhysicalDevicePortabilitySubsetFeaturesKHR:: shaderTessellationInputOutputBarrier capability #50