Closed Themaister closed 4 years ago
windows use amd internal source compiling backend. linux gpuopen use opensourced LLVM amd lightening backend. so the opensource is confined in llvm intrinsic e.g .
OpenSource has to use mov.dpp to update the swizzled value but the internal source compiling backend "dpp modifier" can be applied directly to the add/min/max modified.
OpenSouce does not control execution mask directly. so it resort to thread id turn/off thread. etc.
I've tried some subgroup-code on AMDVLK (RX 470), and seeing very different results compared to Windows. I think the Windows codegen is better as I'm getting a 15% gain on Windows, but just 5% on AMDVLK by using subgroup-ops on my more complex test below. Overall, Windows runs significantly faster after adding subgroup ops, and without it's within margin of error.
First, a very reduced example: https://github.com/Themaister/Granite/blob/master/tests/assets/shaders/subgroup.comp
AMDVLK GCN output:
There's a split DPP and OR, but Windows compiler just emits OR. In the scalar load loop, there is ping ponging via VGPRs then readfirstlanes it back to SGPR then does the scalar load, while Windows compiler stays in SGPRs. Windows also gets x8/x4 load and AMDVLK gets three x4 loads.
Windows (Driver 18.11.2):
For a more complex example, I tried added subgroupOr() to my clustered shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer.h#L105
We have a similar difference where AMDVLK will ping-pong between SGPR and VGPR in the loop, while AMD Windows stays in SGPR.
AMDVLK:
Highlight:
Windows:
Highlight:
SPIR-V: output.asm.txt