Try / Tempest

3d graphics engine
MIT License
83 stars 24 forks source link

Mesh shader emulation over draw-indirect #38

Open Try opened 1 year ago

Try commented 1 year ago

Based on #33

Initial implementation is practically working, this ticket is to track technical depth and for profiling work.

TODO:

ERR(wont't fix):

Try commented 1 year ago

Profiler view on NV (native mesh-shader is disabled): изображение

Try commented 1 year ago

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly): Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is: Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects. Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Update to strategy:

Cross-thread semantics is a big issue. different thread can populate different parts of gl_Positions/varying.

Initial toughs: for any write-out store out[exp1] = exp2; analyze exp1, exp2: exp2, like been said: constant-uniform-input expression exp1 - simple function of gl_LocalInvocationID all other parts of left side (.x) - straight constants

for all possible values of gl_LocalInvocationID [0..31), engine should populate mapping table, for each varying: gl_LocalInvocationID <--> id in var[id]

if all varyings are written from same thread, than that gl_LocalInvocationID can be written out to index-buffer

Try commented 1 year ago

Initial work on ShaderAnalyzer in 203ab2d.

Can roughly estimate thread-mapping for vertex/varyings for simple(OpenGothic) cases. TODO: handle more advance control flow instructions.

Try commented 1 year ago

Mesh-emulation still slower than draw-call spam in opengothic case. Hi-Z pass become surprisingly expensive: ~1.4ms in NVIDIA, with total 162k triangles.

Current ideas:

  1. Render only closemost pieces if HiZ (possible false holes)
  2. Cull HiZ against itself a. Draw close-most normally, cull others against previous HiZ b. Since HiZ is 64x64 at most, compute-driven rasterization is possible
Try commented 1 year ago

Testing FPS on Intel UHD:

SM NoSM
DrawIndexed 38 53
Compute 24 32
Compute+Vertex 29 40

изображение

Try commented 1 year ago

More numbers:

    HiZ(M) HiZ(sort) HiZ(draw) HiZ SM0 SM1 BasePass Transparent FPS    
Nvidia native(MS)       0.14         222    
Intel native(DI)                 30.05    
                         
Nvidia MS0 0.46 0.53 0.10 1.66 0.36 1.21 5.13 0.8 75.00   «Cake» shading
Intel                   14.2    
                         
Nvidia MS-VS 0.21 0.17 0.67 1.05 0.24 0.47 1.23 0.09 86   VRAM, L1 pressure in draw
Intel                   19.2    
Try commented 1 year ago

Mesh stage is update to EXT: изображение

Since neither NVidia, neither Intel support compute-to-graphics overlap in same command buffer new take on runtime is:

  1. Mesh-compute populates desc-buffer, index-length, scratch buffers. Scratch has indices+varying data
  2. Prefix pass - computes first index for each draw-call; per-initializes indirect buffer
  3. Sort pass copies only indices to back of scratch buffer.
  4. Scratch buffer then used directly in indirect draw-pass

TODO: task stage/task payload nested structs in outputs

Try commented 1 year ago

Hm, task shader appear to be way bigger problem that I expected. In straight indirect-based workflow one dispatchMeshTask can emit up to gl_NumWorkGroups followup mesh grids. That mean gl_NumWorkGroups of compute-indirect commands. Not to mention: no clean way to pass payload to those dispatches.

Some ideas:

  1. Inline mesh stage into task shader:

    void main()
    {
    if(gl_LocalInvocationID < max_task_threads)
    task_main();
    barrier(); // make sure that task stage is done
    for(int i=0; i<mesh_groups; ++i)
    {
    if(gl_LocalInvocationID < max_mesh_threads)
      mesh_main();  
    }
    }

    Cons: wont work reliable with inner barriers, wont work fast with large expansion factor

  2. Mega-dispath-indirect issue only single vkCmdDispathIndirect, with total groups count = sum of emitted workgroups from each group. supliment it with some-sort of LUT-table, so mesh call can understand his 'real' gl_GlobalInvocationID and fetch payload memory

Try commented 6 months ago

Task + Mesh. all emulated via gpu-compute, Intel UHD: изображение

Try commented 6 months ago

Recent test on Intel, with time-stamp based profiler.

Task Task-lut Mesh Mesh-sort Draw
HiZ occluder 0.03 0.03 0.33 0.23 0.49
GBuf all 0.89 0.04 2.77 0.57 4.54
Shadow0 0.11 0.02 0.19 0.11 0.83
Shadow1 0.33 0.03 1.49 0.46 3.06