Mesh shader emulation over draw-indirect

Try commented 1 year ago

Based on #33

Initial implementation is practically working, this ticket is to track technical depth and for profiling work.

TODO:

[ ] Lines/Points
[ ] test flat and other interpolators
[x] Fix in uvec3 gl_WorkGroupIDpolution
[ ] Fix in uvec3 gl_NumWorkGroups - polluted due to dispatch indirect
[ ] Fix in uvec3 gl_GlobalInvocationID // polluted, since it is byproduct of gl_WorkGroupID
[ ] Control/sanitize out-of-memory case
[ ] perprimitiveEXT - no immediate need

ERR(wont't fix):

[ ] Draw order is lost inside a single draw-call (not an issue for 3D)

Try commented 1 year ago

Profiler view on NV (native mesh-shader is disabled):

Try commented 1 year ago

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Update to strategy:

Cross-thread semantics is a big issue. different thread can populate different parts of gl_Positions/varying.

Initial toughs: for any write-out store out[exp1] = exp2; analyze exp1, exp2: exp2, like been said: constant-uniform-input expression exp1 - simple function of gl_LocalInvocationID all other parts of left side (.x) - straight constants

for all possible values of gl_LocalInvocationID [0..31), engine should populate mapping table, for each varying: gl_LocalInvocationID <--> id in var[id]

if all varyings are written from same thread, than that gl_LocalInvocationID can be written out to index-buffer

Try commented 1 year ago

Initial work on ShaderAnalyzer in 203ab2d.

Can roughly estimate thread-mapping for vertex/varyings for simple(OpenGothic) cases. TODO: handle more advance control flow instructions.

Try commented 1 year ago

Mesh-emulation still slower than draw-call spam in opengothic case. Hi-Z pass become surprisingly expensive: ~1.4ms in NVIDIA, with total 162k triangles.

Current ideas:

Render only closemost pieces if HiZ (possible false holes)
Cull HiZ against itself a. Draw close-most normally, cull others against previous HiZ b. Since HiZ is 64x64 at most, compute-driven rasterization is possible

Try commented 1 year ago

Testing FPS on Intel UHD:

	SM	NoSM
DrawIndexed	38	53
Compute	24	32
Compute+Vertex	29	40

Try commented 1 year ago

More numbers:

		HiZ(M)	HiZ(sort)	HiZ(draw)	HiZ	SM0	SM1	BasePass	Transparent	FPS
Nvidia	native(MS)				0.14					222
Intel	native(DI)									30.05

Nvidia	MS0	0.46	0.53	0.10	1.66	0.36	1.21	5.13	0.8	75.00	«Cake» shading
Intel										14.2

Nvidia	MS-VS	0.21	0.17	0.67	1.05	0.24	0.47	1.23	0.09	86	VRAM, L1 pressure in draw
Intel										19.2

Try commented 1 year ago

Mesh stage is update to EXT:

Since neither NVidia, neither Intel support compute-to-graphics overlap in same command buffer new take on runtime is:

Mesh-compute populates desc-buffer, index-length, scratch buffers. Scratch has indices+varying data
Prefix pass - computes first index for each draw-call; per-initializes indirect buffer
Sort pass copies only indices to back of scratch buffer.
Scratch buffer then used directly in indirect draw-pass

TODO: task stage/task payload nested structs in outputs

Try commented 1 year ago

Hm, task shader appear to be way bigger problem that I expected. In straight indirect-based workflow one dispatchMeshTask can emit up to gl_NumWorkGroups followup mesh grids. That mean gl_NumWorkGroups of compute-indirect commands. Not to mention: no clean way to pass payload to those dispatches.

Some ideas:

Inline mesh stage into task shader:

void main()
{
if(gl_LocalInvocationID < max_task_threads)
task_main();
barrier(); // make sure that task stage is done
for(int i=0; i<mesh_groups; ++i)
{
if(gl_LocalInvocationID < max_mesh_threads)
  mesh_main();  
}
}

Cons: wont work reliable with inner barriers, wont work fast with large expansion factor

Mega-dispath-indirect issue only single vkCmdDispathIndirect, with total groups count = sum of emitted workgroups from each group. supliment it with some-sort of LUT-table, so mesh call can understand his 'real' gl_GlobalInvocationID and fetch payload memory

Try commented 6 months ago

Task + Mesh. all emulated via gpu-compute, Intel UHD:

Task for GBuffer takes lot of time. Probably due to FE pressure (on NVidia it is FE-pressure, with 2k of 1x1x1 compute jobs) Native task probably has similar problem, just hidden via pipelining. In theory multi-queue can be a solution, but Intel doesn't have that.
Mesh-shader sort takes too big of a budget. can be fixed by per-allocating IBO for max_primitives upfront.
Cull-face should be utilized, to help with Shadow-passes

Try commented 6 months ago

Recent test on Intel, with time-stamp based profiler.

	Task	Task-lut	Mesh	Mesh-sort	Draw
HiZ occluder	0.03	0.03	0.33	0.23	0.49
GBuf all	0.89	0.04	2.77	0.57	4.54
Shadow0	0.11	0.02	0.19	0.11	0.83
Shadow1	0.33	0.03	1.49	0.46	3.06

Try / Tempest

Mesh shader emulation over draw-indirect #38