More optimized bbox mesh shaders

pixeljetstream commented 1 year ago

Cool project! You might want to look at https://github.com/nvpro-samples/gl_occlusion_culling which has a few variants on how to speed up bbox rasterization and mesh shader variant is among them.

Key difference is that it does multiple boxes per mesh shader workgroup.

MCRcortex commented 1 month ago

The limiting performance factor is not occlusion testing it is the VTG ISBE memory that holds the transfer data between the mesh and fragment shader, it heavily bottlenecks the render pipeline, reducing triangle count is the only way to feasibly do this, which can be done by implementing proper meshlet culling, or better more tight bounding boxes

pixeljetstream commented 1 month ago

okay, sounds good to cull more! and you currently have a lot of spare threads in the task shader and subgroup intrinsics are great for that.

For lowering ISBE, f16vec2 will still expand to fp32 when used in interpolants on NV hardware. To save ISBE you'd have to go with uint and do the unpacking and interpolation using barycentrics extension (though maybe you already tried?). In general moving more work to fragment shader could be a way to balance the ISBE bottleneck. You can even use quad shuffle to distribute some per-vertex work across 3 of the 4 threads in fragment shader quads.

Just as fyi, the task shader output also counts as ISBE, there is an allocation granularity of 128 bytes and GL reserves some bytes to store gl_TaskCount (which you can access via gl_NumWorkGroups.x in mesh shader btw, so in theory can drop task's quadCount). It's payload input/output should respect non-32-bit sized datatypes properly.

MCRcortex commented 1 month ago

hmm yea, the issue is the lack of proper culling, cause when testing, just passing the UV to the frag shader (so no tint/addin/anything else) resulted in double the fps, so its defiantly bottlenecked there, now whether it is bottlenecked there due to the shear amount of primatives its rendering (of which majority?? id say is subpixel?), i am not sure. When testing a while a go it was pulling 700-800gb/s on my 3080ti (according to sensor monitor, in some specific scenes) and slamming against the power limit on both my laptop and desktop. Unsure if per primative culling is worth it? i do know per meshlet culling would help to reduce the overdraw + number of mesh threads/workers spawned consuming ISBE and power. I havent tried using the barycentrics extension no, but the issue is more so interpolating the colours, since those are per vertex data, i did notice however that the ampere architecture has a warp size of 48 compared to 32 (according to nsight) which i havent played around with as of yet. Thanks for the tips! If you have any more ideas or experiments to try <3, i was thinking about the subgroup intrinsics but they seem to be finicky to get right xD, probably going to use them in my other project however.

Thank you for looking at this project, its probably not great by your standards, but it was great fun to tinker with mesh shaders! (and especially bindless sparse buffers, which are so cursed they dont even work on linux)

pixeljetstream commented 1 month ago

not sure where you would see that it's 48 threads, that would be a bug in nsight, NV architectre is 32 threads since almost forever, without indications that it would change.

especially if a ton is subpixel, moving work to fragment shader can help, since it's evaluated after small triangles are rejected. I work mostly on optimizations for CAD datasets which also suffer from this heavy geometry load. some experiments here: https://github.com/nvpro-samples/gl_vk_meshlet_cadscene

MCRcortex commented 1 month ago

Ah checking back its warp slots not threads my apologies

Thanks! ill experiment with moving colour, fog and light sampling to be per fragment hopefully that yields some performance gains without having to deepdive into meshoptimizer meshlet generation + hiz culling 😆

pixeljetstream commented 1 month ago

okay, if you do want to dive into that: here is some code to build hiz texture using compute shader: https://github.com/nvpro-samples/vk_displacement_micromaps/blob/main/nvhiz-update.comp.glsl and while vulkan still shouldn't be too hard to extract: https://github.com/nvpro-samples/vk_displacement_micromaps/blob/main/nvhiz_vk.cpp

and some testing code is found here https://github.com/nvpro-samples/vk_displacement_micromaps/blob/main/draw_culling.glsl

MCRcortex commented 1 month ago

Thank you very much for you time and advice!

MCRcortex / nvidium

More optimized bbox mesh shaders #24