Open Try opened 1 year ago
Profiler view on NV (native mesh-shader is disabled):
New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple
.mesh
into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.
uniform-function
to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.The only problem is
gl_WorkGroupID.x
that is used all over the place
Update to strategy:
Cross-thread semantics is a big issue. different thread can populate different parts of gl_Positions/varying.
Initial toughs: for any write-out store
out[exp1] = exp2;
analyze exp1, exp2:
exp2
, like been said: constant-uniform-input expression
exp1
- simple function of gl_LocalInvocationID
all other parts of left side (.x
) - straight constants
for all possible values of gl_LocalInvocationID
[0..31), engine should populate mapping table, for each varying:
gl_LocalInvocationID
<--> id
in var[id]
if all varyings are written from same thread, than that gl_LocalInvocationID
can be written out to index-buffer
Initial work on ShaderAnalyzer
in 203ab2d.
Can roughly estimate thread-mapping for vertex/varyings for simple(OpenGothic) cases. TODO: handle more advance control flow instructions.
Mesh-emulation still slower than draw-call spam in opengothic case. Hi-Z pass become surprisingly expensive: ~1.4ms in NVIDIA, with total 162k triangles.
Current ideas:
Testing FPS on Intel UHD:
SM | NoSM | |
---|---|---|
DrawIndexed | 38 | 53 |
Compute | 24 | 32 |
Compute+Vertex | 29 | 40 |
More numbers:
HiZ(M) | HiZ(sort) | HiZ(draw) | HiZ | SM0 | SM1 | BasePass | Transparent | FPS | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Nvidia | native(MS) | 0.14 | 222 | |||||||||
Intel | native(DI) | 30.05 | ||||||||||
Nvidia | MS0 | 0.46 | 0.53 | 0.10 | 1.66 | 0.36 | 1.21 | 5.13 | 0.8 | 75.00 | «Cake» shading | |
Intel | 14.2 | |||||||||||
Nvidia | MS-VS | 0.21 | 0.17 | 0.67 | 1.05 | 0.24 | 0.47 | 1.23 | 0.09 | 86 | VRAM, L1 pressure in draw | |
Intel | 19.2 |
Mesh stage is update to EXT:
Since neither NVidia, neither Intel support compute-to-graphics overlap in same command buffer new take on runtime is:
TODO: task stage/task payload nested structs in outputs
Hm, task shader appear to be way bigger problem that I expected. In straight indirect-based workflow one dispatchMeshTask
can emit up to gl_NumWorkGroups
followup mesh grids.
That mean gl_NumWorkGroups
of compute-indirect commands. Not to mention: no clean way to pass payload to those dispatches.
Some ideas:
Inline mesh stage into task shader:
void main()
{
if(gl_LocalInvocationID < max_task_threads)
task_main();
barrier(); // make sure that task stage is done
for(int i=0; i<mesh_groups; ++i)
{
if(gl_LocalInvocationID < max_mesh_threads)
mesh_main();
}
}
Cons: wont work reliable with inner barriers, wont work fast with large expansion factor
Mega-dispath-indirect
issue only single vkCmdDispathIndirect
, with total groups count = sum of emitted workgroups from each group.
supliment it with some-sort of LUT-table, so mesh call can understand his 'real' gl_GlobalInvocationID
and fetch payload memory
Task + Mesh. all emulated via gpu-compute, Intel UHD:
max_primitives
upfront.Recent test on Intel, with time-stamp based profiler.
Task | Task-lut | Mesh | Mesh-sort | Draw | |
---|---|---|---|---|---|
HiZ occluder | 0.03 | 0.03 | 0.33 | 0.23 | 0.49 |
GBuf all | 0.89 | 0.04 | 2.77 | 0.57 | 4.54 |
Shadow0 | 0.11 | 0.02 | 0.19 | 0.11 | 0.83 |
Shadow1 | 0.33 | 0.03 | 1.49 | 0.46 | 3.06 |
Based on #33
Initial implementation is practically working, this ticket is to track technical depth and for profiling work.
TODO:
flat
and other interpolatorsin uvec3 gl_WorkGroupID
polutionin uvec3 gl_NumWorkGroups
- polluted due to dispatch indirectin uvec3 gl_GlobalInvocationID
// polluted, since it is byproduct of gl_WorkGroupIDERR(wont't fix):