gfx-rs / wgpu

A cross-platform, safe, pure-Rust graphics API.
https://wgpu.rs
Apache License 2.0
12.67k stars 926 forks source link

[naga]Naga does not generate `for` loops #6521

Open b0nes164 opened 3 days ago

b0nes164 commented 3 days ago

Currently, naga exclusively generates while loops when translating code for downstream compilers. This:

To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):

WebGPU Implementation comparison

I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:

//wgpu+naga 23.0 Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 5527.18 Average Fallbacks Initiated per Pass: 1026.484 Average Successful Fallback Insertions per Pass: 0.082

//dawn+tint Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 3625.62 Average Fallbacks Initiated per Pass: 362.902 Average Successful Fallback Insertions per Pass: 0.116



The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.
b0nes164 commented 2 days ago

The first thing I forgot to mention is that also data gathered here was already on unchecked shaders. If this was purely performance degradation caused caused by 6285, I would expect a performance decrease from 22.0 -> 23.0, not the increase we are seeing here.

I also did not put forward what a solution would be. I don't think all for loops need be translated, though that would be nice. Instead, a guarantee to translate for loops that meet certain conditions--compiler visible constant on the conditional, no complicated control flow--would be enough to solve these issues.