Closed Try closed 2 weeks ago
Now, since mesh-shading is released for OpenGothic can start thinking about next steps.
With VK_NV_mesh_shader
all fits fine with the engine, just need to emulate them on else platforms.
Idea for emulation workflow:
OpDecorate %1234 BuiltIn PrimitiveCountNV <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output <-- should be mutated to shared-variable
// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
uint countersCount;
uint counters[];
} engine;
---
// tail of the main function
if(_gl_PrimitiveCountNV!=0) {
uint pos = atomicAdd(engine.countersCount, 1);
engine.counters[pos] = _gl_PrimitiveCountNV;
}
Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.
// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
uint32_t indexCount;
uint32_t instanceCount;
uint32_t firstIndex; // prefix sum
int32_t vertexOffset; // can be abused to offset into var_buffer
uint32_t firstInstance; // caps: should be zero
};
each vkCmdDrawMeshTasks
get replaced by vkCmdDrawIndexedIndirect
, that consumes var_buffer
and passing it to fragment shader.
vkEvent
should be fine to synchronize execution of previous set of compute shaders for now.
Generating extra compute shaders will require a way to insert vkCmdDispatch
commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:
Some experiments:
libspiv
- internal utility library for spir-v tooling.mesh
to .comp
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
OpCapability Shader
%1 = OpExtInstImport "GLSL.std.450"
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main"
OpExecutionMode %main LocalSize 1 1 1
OpSource GLSL 450
OpSourceExtension "GL_NV_mesh_shader"
OpName %main "main"
OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
OpName %Vbo "Vbo"
OpMemberName %Vbo 0 "vertices"
OpName %_ ""
OpName %PerVertexData "PerVertexData"
OpMemberName %PerVertexData 0 "color"
OpName %v_out "v_out"
OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
OpDecorate %_runtimearr_v2float ArrayStride 8
OpMemberDecorate %Vbo 0 NonWritable
OpMemberDecorate %Vbo 0 Offset 0
OpDecorate %Vbo BufferBlock
OpDecorate %_ DescriptorSet 0
OpDecorate %_ Binding 0
OpDecorate %v_out Location 0
OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
OpDecorate %80 DescriptorSet 1
OpDecorate %80 Binding 0
OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
%void = OpTypeVoid
%3 = OpTypeFunction %void
%float = OpTypeFloat 32
%v4float = OpTypeVector %float 4
%uint = OpTypeInt 32 0
%uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
%uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
%uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
%int = OpTypeInt 32 1
%int_0 = OpConstant %int 0
%v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
%Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
%_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
%float_0 = OpConstant %float 0
%float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
%int_1 = OpConstant %int 1
%int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
%v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
%54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
%56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
%58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
%uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
%uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
%v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
%v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
%74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
%75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
%76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
%77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
%80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
%main = OpFunction %void None %3
%5 = OpLabel
%27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
%28 = OpLoad %v2float %27
%31 = OpCompositeExtract %float %28 0
%32 = OpCompositeExtract %float %28 1
%33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
%35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
OpStore %35 %33
%37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
%38 = OpLoad %v2float %37
%39 = OpCompositeExtract %float %38 0
%40 = OpCompositeExtract %float %38 1
%41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
%42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
OpStore %42 %41
%44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
%45 = OpLoad %v2float %44
%46 = OpCompositeExtract %float %45 0
%47 = OpCompositeExtract %float %45 1
%48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
%49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
OpStore %49 %48
%55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
OpStore %55 %54
%57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
OpStore %57 %56
%59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
OpStore %59 %58
%65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
OpStore %65 %uint_0
%66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
OpStore %66 %uint_1
%68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
OpStore %68 %uint_2
OpStore %g1_PrimitiveCountNV %uint_1
OpReturn
OpFunctionEnd
In here:
out
variables for spirv<1.4Strategy update, for compue-driven workflow:
.mesh.comp
- this will simplify code-gen and C++ workflow.vert
Extra descriptor set:
struct IndirectCmd { // 32 bytes
uint indexCount;
uint instanceCount;
uint firstIndex; // prefix sum
int vertexOffset; // can be abused to offset into var_buffer
uint firstInstance; // caps: should be zero
uint self; // sequential id of dispatchMesh class, in render-pass
uint padd0;
uint padd1;
}; // 32 bytes
layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
IndirectCmd cmd[];
} indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex
layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
uint grow;
uint ibo[];
} ind;
layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
uint grow;
uint vbo[];
} var;
layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
uint grow; // and dispatchX
uint dispatchY; // =1
uint dispatchZ; // =1
uint desc[];
} mesh;
layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
uint ibo[];
} indFlat;
Workflow by example:
enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
enc.setUniforms(pso,ubo);
enc.dispatchMesh(0,3);
enc.dispatchMesh(3,2);
Will be translated as:
enc.setUniforms(pso_compute_ms,ubo);
// vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
enc.dispatch(3, 1,1);
// vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
// TODO: pass base taskID somehow
enc.dispatch(2, 1,1);
....
VkBufferMemoryBarrier(comp -> comp, indirect.ind);
// after all 'dispatchMesh' are done
// prefix summ pass doest 2 jobs actually:
// indirect.ind[i] firstIndex = prefixSumm(indexCount);
// indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
enc.setUniforms(psoSum,uboSum);
enc.dispatch(1,1,1); // 1 group with 256 threads
// should be dispatch-indirect
VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
enc.setUniforms(psoCompactage,uboCompactage);
enc.dispatchIndirect(mesh.grow,1,1);
VkBufferMemoryBarrier(comp -> vert);
// main rendering, as drawIndirect
enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
enc.setUniforms(pso,ubo);
env.drawIndirect(indirect.cmd[0]);
env.drawIndirect(indirect.cmd[1]);
// vert -> comp barrier at end of render-pass
Current implementation:
vkCmdDrawIndexedIndirect
TODO:
VMeshShaderEmulated
as special case in related pieces in engineFirst proof of concept kind triangle:
TODO: Need to pass somehow firstTask
and selfId
to compute shader
Current idea for firstTask
and selfId
pass:
Use Y/Z inputs of vkCmdDispatchBase
.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1)
. This will break some builtin variables.
// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize; // unaffected
// workgroup and invocation IDs
in uvec3 gl_WorkGroupID; // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected
// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected
Almost there:
Normals are bugged-out, because translator can't handle arrayed varyings
Running stable on OpenGothic:
New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh
into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.
uniform-function
to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.
The only problem is gl_WorkGroupID.x
that is used all over the place
GL_EXT_spirv_intrinsics is out. Surprisingly allows to bypass some of compiler https://shader-playground.timjones.io/626ea18db0663c9ef7d1257940b7a195
Closing: indirect is mostly implemented in engine (except builtin's - ignore them for now)
This ticket is to track ideas/known solutions to GPU-driven.
Vulkan-Extensions
Known production solutions
Current idea
Use
VK_NV_mesh_shader
as starting point. And build some emulation layer to enable mesh-shader on wider range of hardware.