[naga]Naga does not generate `for` loops

Currently, naga exclusively generates while loops when translating code for downstream compilers. This:

Breaks compilation of some shaders on FXC because loops can't be unrolled so arrays cannot be dynamically indexed.
Prior to #6285, potentially caused UB on metal backend.
Causes suboptimal code generation on downstream compilers, again because loops cannot be unrolled. See #4499.

To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):

WebGPU Implementation comparison

I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:

It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.

It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:


//wgpu+naga 22.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 10181.346
Average Fallbacks Initiated per Pass: 1867.042
Average Successful Fallback Insertions per Pass: 2.922

//wgpu+naga 23.0 Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 5527.18 Average Fallbacks Initiated per Pass: 1026.484 Average Successful Fallback Insertions per Pass: 0.082

//dawn+tint Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 3625.62 Average Fallbacks Initiated per Pass: 362.902 Average Successful Fallback Insertions per Pass: 0.116



The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.

gfx-rs / wgpu

[naga]Naga does not generate `for` loops #6521