Open random-yang opened 4 months ago
Device Info: Apple M1 Pro
In this case we're not really comparing apples to apples and you're absolutely right that it's about the sorting.
A more fair comparison would be if you isolate only the rasterizing, excluding the sorting. In this case I'm getting far more than 60 fps on my machine with this implementation.
The project you're referring to is sorting all splats asynchronously with a CPU worker, leaving only the rasterizing to the GPU. This leads to artifacts when you change view angle where you can see the splats bering sorted after your view already has changed.
This implementation relies on synchronous sorting every frame and does so with my own shady implementation of a radix sort which leaves a lot to be desired regarding performance.
There are many things that can be improved with my implementation, but I would say there are 2 main hurdles to think about.
Apple hardware have a disadvantage compared to e.g. Nvidia in that Apple lack the ability to await execution between blocks in a single pass which prevents us from implementing state of the art Radix sort. github.com/fynv has a faster version of the sort in his Github, but unfortunately it doesn't work on Apple hardware.
The WebGPU standard has no easy way of accessing subgroups/warps which would optimize memory access patterns and hence open up for better perf. wgpu_sort solves this by guessing the subgroup size among other things and their impl. works very well, giving me rendering of >60 fps. I thought about porting this version to typescript, but I didn't get to it.
@MarcusAndreasSvensson Thank you very much for your patient explanation. Your insights are very enlightening. I will also delve into some in-depth research, and I look forward to sharing any new findings with you.
Compare to https://github.com/antimatter15/splat?tab=readme-ov-file webgl impl. In my understanding, webGPU performance should be better. I guess because radixsort implemented on GPU, or do I miss something?