Try / Tempest

API abstraction layer for 3D graphics, UI and sound. Written in C++17 with Vulkan, DX12 and Metal support.
MIT License
107 stars 27 forks source link

GPU driven rendering #33

Closed Try closed 2 weeks ago

Try commented 2 years ago

This ticket is to track ideas/known solutions to GPU-driven.

Vulkan-Extensions

  1. VK_NV_mesh_shader https://www.geeks3d.com/20200519/introduction-to-mesh-shaders-opengl-and-vulkan/ https://on-demand.gputechconf.com/siggraph/2018/video/sig1811-3-christoph-kubisch-mesh-shaders.html http://vbomesh.blogspot.com/2018/09/meshlets.html
  2. vkCmdDrawIndexedIndirect https://vkguide.dev/docs/gpudriven/draw_indirect/
  3. VK_EXT_conditional_rendering https://www.saschawillems.de/blog/2018/09/05/vulkan-conditional-rendering/

Known production solutions

  1. Assassins creed https://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf
  2. AMD: March of the froblins https://developer.amd.com/wordpress/media/2013/01/Chapter03-SBOT-March_of_The_Froblins.pdf
  3. Frostbite https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf
  4. Nanite https://www.elopezr.com/a-macro-view-of-nanite/ https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf

Current idea

Use VK_NV_mesh_shader as starting point. And build some emulation layer to enable mesh-shader on wider range of hardware.

Try commented 2 years ago

Now, since mesh-shading is released for OpenGothic can start thinking about next steps.

With VK_NV_mesh_shader all fits fine with the engine, just need to emulate them on else platforms.

Idea for emulation workflow:

Spirv patching notes:

OpDecorate %1234 BuiltIn PrimitiveCountNV    <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output  <-- should be mutated to shared-variable

Counting shader

// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
    uint countersCount;
    uint counters[];
} engine;
---
// tail of the main function
  if(_gl_PrimitiveCountNV!=0) {
    uint pos = atomicAdd(engine.countersCount, 1);
    engine.counters[pos] = _gl_PrimitiveCountNV;
    }

Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.

// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
   uint32_t    indexCount;
   uint32_t    instanceCount;
   uint32_t    firstIndex; // prefix sum
   int32_t     vertexOffset; // can be abused to offset into var_buffer
   uint32_t    firstInstance; // caps: should be zero
   };

Final draw

each vkCmdDrawMeshTasks get replaced by vkCmdDrawIndexedIndirect, that consumes var_buffer and passing it to fragment shader.

Multiple renderpasses

vkEvent should be fine to synchronize execution of previous set of compute shaders for now.

Split command-buffers

Generating extra compute shaders will require a way to insert vkCmdDispatch commands into begin of render-pass. Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers. Cons:

Issues

Try commented 2 years ago

Some experiments:

  1. Added libspiv - internal utility library for spir-v tooling
  2. First attempts to convert .mesh to .comp
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_NV_mesh_shader"
               OpName %main "main"
               OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
               OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
               OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
               OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
               OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
               OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
               OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
               OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
               OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
               OpName %Vbo "Vbo"
               OpMemberName %Vbo 0 "vertices"
               OpName %_ ""
               OpName %PerVertexData "PerVertexData"
               OpMemberName %PerVertexData 0 "color"
               OpName %v_out "v_out"
               OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
               OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
               OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
               OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
               OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
               OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
               OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
               OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
               OpDecorate %_runtimearr_v2float ArrayStride 8
               OpMemberDecorate %Vbo 0 NonWritable
               OpMemberDecorate %Vbo 0 Offset 0
               OpDecorate %Vbo BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %v_out Location 0
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
               OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
               OpDecorate %80 DescriptorSet 1
               OpDecorate %80 Binding 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
               OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
               OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
               OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
     %uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
     %uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
    %v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
        %Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
          %_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
      %int_1 = OpConstant %int 1
      %int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
      %v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
         %54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
         %56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
         %58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
     %uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
    %v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
         %74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
         %77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
         %80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
       %main = OpFunction %void None %3
          %5 = OpLabel
         %27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
         %28 = OpLoad %v2float %27
         %31 = OpCompositeExtract %float %28 0
         %32 = OpCompositeExtract %float %28 1
         %33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
         %35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
               OpStore %35 %33
         %37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
         %38 = OpLoad %v2float %37
         %39 = OpCompositeExtract %float %38 0
         %40 = OpCompositeExtract %float %38 1
         %41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
         %42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
               OpStore %42 %41
         %44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
         %45 = OpLoad %v2float %44
         %46 = OpCompositeExtract %float %45 0
         %47 = OpCompositeExtract %float %45 1
         %48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
         %49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
               OpStore %49 %48
         %55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
               OpStore %55 %54
         %57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
               OpStore %57 %56
         %59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
               OpStore %59 %58
         %65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
               OpStore %65 %uint_0
         %66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
               OpStore %66 %uint_1
         %68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
               OpStore %68 %uint_2
               OpStore %g1_PrimitiveCountNV %uint_1
               OpReturn
               OpFunctionEnd

In here:

Try commented 2 years ago

Strategy update, for compue-driven workflow:

Extra descriptor set:

struct IndirectCmd { // 32 bytes
  uint    indexCount;
  uint    instanceCount;
  uint    firstIndex;    // prefix sum
  int     vertexOffset;  // can be abused to offset into var_buffer
  uint    firstInstance; // caps: should be zero

  uint    self;  // sequential id of dispatchMesh class, in render-pass
  uint    padd0;
  uint    padd1;
  }; // 32 bytes

layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
  IndirectCmd cmd[];
  } indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex

layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
  uint    grow;
  uint    ibo[];
  } ind;

layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
  uint    grow;
  uint    vbo[];
  } var;

layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
  uint    grow; // and dispatchX
  uint    dispatchY; // =1
  uint    dispatchZ; // =1
  uint    desc[];
  } mesh;

layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
  uint    ibo[];
  } indFlat;

Workflow by example:

      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      enc.dispatchMesh(0,3);
      enc.dispatchMesh(3,2);

Will be translated as:

      enc.setUniforms(pso_compute_ms,ubo);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
      enc.dispatch(3, 1,1);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
      // TODO: pass base taskID somehow
      enc.dispatch(2, 1,1);
     ....
      VkBufferMemoryBarrier(comp -> comp, indirect.ind);
      // after all 'dispatchMesh' are done
      // prefix summ pass doest 2 jobs actually:
      // indirect.ind[i] firstIndex = prefixSumm(indexCount);
      // indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
      enc.setUniforms(psoSum,uboSum);
      enc.dispatch(1,1,1); // 1 group with 256 threads
      // should be dispatch-indirect
      VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
      enc.setUniforms(psoCompactage,uboCompactage);
      enc.dispatchIndirect(mesh.grow,1,1);
      VkBufferMemoryBarrier(comp -> vert);

      // main rendering, as drawIndirect
      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      env.drawIndirect(indirect.cmd[0]);
      env.drawIndirect(indirect.cmd[1]);
      // vert -> comp barrier at end of render-pass
Try commented 2 years ago

Current implementation: изображение

  1. Each dispatch-mesh call works as pair of compute shader + draw-indirect
  2. Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
  3. Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to vkCmdDrawIndexedIndirect

TODO:

  1. Add VMeshShaderEmulated as special case in related pieces in engine
  2. Take care of pipeline-memory allocation and scheduling in general
Try commented 2 years ago

First proof of concept kind triangle: изображение

TODO: Need to pass somehow firstTask and selfId to compute shader

Try commented 2 years ago

Current idea for firstTask and selfId pass:

Use Y/Z inputs of vkCmdDispatchBase. Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1). This will break some builtin variables.

// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize;  // unaffected

// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;  // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected

// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected
Try commented 2 years ago

Almost there: изображение

Normals are bugged-out, because translator can't handle arrayed varyings

Try commented 2 years ago

Running stable on OpenGothic: изображение

Try commented 2 years ago

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Try commented 1 year ago

GL_EXT_spirv_intrinsics is out. Surprisingly allows to bypass some of compiler https://shader-playground.timjones.io/626ea18db0663c9ef7d1257940b7a195

Try commented 2 weeks ago

Closing: indirect is mostly implemented in engine (except builtin's - ignore them for now)