GPUOpen-Drivers / llpc

LLVM-Based Pipeline Compiler
MIT License
165 stars 115 forks source link

Fragment shader not writing to depth buffer correctly #314

Closed s-perron closed 4 years ago

s-perron commented 4 years ago

I have noticed a shader that does not write to the depth buffer correctly. It seems like two OpKills with a load of an image are needed. I find it very odd that the load of the image makes a difference.

I will attached a reduced version of the spir-v, and the gcn that was generated. In the gcn, there is a sample trace through the code that shows the problem. Please let me know if I have misunderstood any of the instructions. I'm still learning gcn.

s-perron commented 4 years ago

reduced.trace.gcn.txt reduced.spvasm.txt

s-perron commented 4 years ago

I need to go back to the drawing board. The first instructions in BB0_4 is an OR, but I treated it like an AND.

linqun commented 4 years ago

What will happen if change the sample function from texture to textureLod? use textureLod can avoid set whole quad mode (s_wqm_b64), and simplify the generated gcn code.

s-perron commented 4 years ago

After looking at it more, I believe the implicit lod is required for to see the problem. The s_wqm_b64 instruction is done at the start of the function. Presumebly because the implicit lod needs all threads in the quad to be active.

However, after the code related to the first OpKill, some threads will get turned off. When the image_sameple_b is reached, not every thread in quads with active threads will be active.

Does this sound right?

s-perron commented 4 years ago

After looking at it further, I believe that the "SI Whole Quad Mode" is not complete.

Non uniform control flow will cause a work group that is in whole quad mode to no longer be in whole quad mode because part of a quad could go one way and the rest another. However, there is no instruction to put it back into whole quad mode.

I can take a look at it next week if nobody is looking at it yet.

Here is another example:

               OpCapability Shader
               OpCapability SampledBuffer
               OpCapability StorageImageExtendedFormats
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %ShadePixel "main" %gl_FragCoord
               OpExecutionMode %ShadePixel OriginUpperLeft
               OpSource HLSL 600
               OpName %ShadePixel "ShadePixel"
               OpDecorate %gl_FragCoord BuiltIn FragCoord
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %bool = OpTypeBool
    %float_0 = OpConstant %float 0
%_ptr_Input_v4float = OpTypePointer Input %v4float
       %void = OpTypeVoid
         %26 = OpTypeFunction %void
%gl_FragCoord = OpVariable %_ptr_Input_v4float Input
 %ShadePixel = OpFunction %void None %26
         %27 = OpLabel
         %28 = OpLoad %v4float %gl_FragCoord
         %29 = OpCompositeExtract %float %28 0
         %30 = OpFOrdLessThan %bool %29 %float_0
               OpSelectionMerge %42 None
               OpBranchConditional %30 %42 %31
         %31 = OpLabel
         %37 = OpDPdx %v4float %28
         %38 = OpCompositeExtract %float %37 0
         %39 = OpFOrdLessThan %bool %38 %float_0
               OpSelectionMerge %40 None
               OpBranchConditional %39 %41 %40
         %41 = OpLabel
               OpKill
         %40 = OpLabel
               OpBranch %42
         %42 = OpLabel
               OpReturn
               OpFunctionEnd
s-perron commented 4 years ago

Here is the code that llpc generates for the new example

Disassembly of section .text:
_amdgpu_ps_main:
        s_mov_b64 s[0:1], exec                                     // 000000000000: BE80017E 
        s_wqm_b64 exec, exec                                       // 000000000004: BEFE077E 
        v_cmp_ngt_f32_e32 vcc, 0, v2                               // 000000000008: 7C960480 
        s_and_saveexec_b64 s[2:3], vcc                             // 00000000000C: BE82206A 
        s_xor_b64 s[2:3], exec, s[2:3]                             // 000000000010: 8882027E 
        s_cbranch_execz BB0_4                                      // 000000000014: BF88000D 

BB0_1:
        v_mov_b32_e32 v0, v2                                       // 000000000018: 7E000302 
        s_nop 1                                                    // 00000000001C: BF800001 
        v_mov_b32_dpp v0, v0  quad_perm:[1,1,1,1] row_mask:0xf bank_mask:0xf bound_ctrl:0// 000000000020: 7E0002FA FF085500 
        s_nop 1                                                    // 000000000028: BF800001 
        v_subrev_f32_dpp v0, v2, v0  quad_perm:[0,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0// 00000000002C: 060000FA FF080002 
        v_mov_b32_e32 v0, v0                                       // 000000000034: 7E000300 
        v_cmp_gt_f32_e32 vcc, 0, v0                                // 000000000038: 7C880080 
        s_and_saveexec_b64 s[4:5], vcc                             // 00000000003C: BE84206A 
        s_xor_b64 s[4:5], exec, s[4:5]                             // 000000000040: 8884047E 

BB0_2:
        s_mov_b64 exec, 0                                          // 000000000044: BEFE0180 

BB0_3:
        s_or_b64 exec, exec, s[4:5]                                // 000000000048: 87FE047E 

BB0_4:
        s_or_b64 exec, exec, s[2:3]                                // 00000000004C: 87FE027E 
        s_and_b64 exec, exec, s[0:1]                               // 000000000050: 86FE007E 
        v_mov_b32_e32 v0, 0                                        // 000000000054: 7E000280 
        exp mrt0 v0, off, off, off done vm                         // 000000000058: C4001801 00000000 
        s_endpgm                                                   // 000000000060: BF810000 
perlfu commented 4 years ago

The "SI Whole Quad Mode" implementation currently only supports Vulkan's model of implicit derivatives. The behaviour of these is only defined for uniform control flow.

From SPIRV 1.5 spec.:

2.20 Code Motion Texturing instructions in the Fragment Execution Model that rely on an implicit derivative cannot be moved into control flow that is not known to be uniform control flow within each derivative group.

I haven't stared hard at the shader here, but I suspect this limitation might be what you are running into.

perlfu commented 4 years ago

If this is the case, then you probably need to be using the DemoteToHelperInvocationEXT extension, substituting OpDemoteToHelperInvocationEXT for OpKill instructions. Note that OpDemoteToHelperInvocationEXT is not a terminator, so it's not always a simple substitution as branches must converge. I've been looking at adding a pass to do this automatically for shaders which rely on this kind of behaviour.

s-perron commented 4 years ago

Thanks for the explanation. That sound exactly like what is going on. This was an hlsl shader being compiled for Vulkan instead of directx. I now think this should be fixed in DXC. It should generate the new instruction instead of an OpKill.

Also the pass you are looking at writing might be better in spirv-tools. It could be used by anybody instead of being specific to AMD hardware. I'll see if it will be needed on our side.