Open b0nes164 opened 3 days ago
The first thing I forgot to mention is that also data gathered here was already on unchecked shaders. If this was purely performance degradation caused caused by 6285, I would expect a performance decrease from 22.0 -> 23.0, not the increase we are seeing here.
I also did not put forward what a solution would be. I don't think all for
loops need be translated, though that would be nice. Instead, a guarantee to translate for
loops that meet certain conditions--compiler visible constant on the conditional, no complicated control flow--would be enough to solve these issues.
Currently, naga exclusively generates
while
loops when translating code for downstream compilers. This:To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):
I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:
It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.
It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:
//wgpu+naga 23.0 Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 5527.18 Average Fallbacks Initiated per Pass: 1026.484 Average Successful Fallback Insertions per Pass: 0.082
//dawn+tint Decoupled Fallback Estimated Occupancy across GPU: 80 Thread Blocks Launched: 8192 Average Total Spins per Pass: 3625.62 Average Fallbacks Initiated per Pass: 362.902 Average Successful Fallback Insertions per Pass: 0.116