Open fancyIX opened 1 year ago
Seems like even sub_group functions are not used for AMD but only for Intel. I mean https://bashbaug.github.io/OpenCL-Docs/html/OpenCL_Ext.html
Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD? See lines 115 till 144 here and lines 20 till 54 here?
Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD?
Partially yes. Also there are lots of LDS reading/writing. I guess using dpp instructions could improve performance a lot, based on my experience of optimizing miners for AMD GPU.
I'm happy to review a pull request for this feature and/or provide some guidance for anyone that wants to develop this. I don't have time myself (nor the hardware to test on), so we'll have to rely on the community.
@CNugteren without modifying the logic much, just replacing LDS r/w, not sure if that can improve the performance a lot. Seems like "invert" and "transpose" can be improved a lot. Basicly any frequency data exchange between threads in a wavefront coud potentially improve the speed. Any suggestions on this?
Regarding optimizing the loads/stores from memory, I'm not sure there is that much to gain, but it depends on the matrix dimensions of course. In the ideal case GEMM is compute-bound and not memory-bound. But I'm not familiar with AMD's recent GPU architectures and thus I can't say much about the actual benefits of these load instructions you are talking about.
Regarding improving transpose or invert functions, I also don't think that is where the big gains are, because ideally they don't consume much time, it is the matrix-multiplication kernel itself afterwards that matters most. But again this depends on the actual parameters the user supplies to the CLBlast program. And also every small bit can help, so contributions there are also welcome.
I think the main benefit could be by using these cross-lane operations on AMD GPUs in the same way the current 'shuffle' instructions are used: to move data across threads in a cheap way, instead of going through the local SRAM memories or caches. But again I haven't studied recent AMD architectures much so I don't know about the impact these instructions can have on the total picture.
Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?
Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html Not sure in current impelmentation what this shfl logic is. Maybe here: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L240
seems like we can replace it with AMD opencl's extension for subgroup shuffling. Not sure how much that could improve the speed. Time saved in r/w LDS may not be much. The may be potential more wavefront can run if we save some LDS usage.
Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?
The main kernel would be the level 3 GEMM kernel (the regular, not 'direct' one). That kernel covers most of the compute heavy computations of CLBlast.
Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html Not sure in current impelmentation what this shfl logic is.
Yes that is the same I think, although that tutorial is quite old compared to the current CLBlast kernel implementation, so some things might have changed.
seems like we can replace it with AMD opencl's extension for subgroup shuffling.
Indeed, see also the links above I posted to point at the Intel and NVIDIA implementations. You can probably add an AMD version there, and then run the CLBLast GEMM tuner and see if you get more performance out.
@CNugteren while I am working on a PR for using cross lane instruction to do subgroup shuffling, I have a question: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3
Here seems like the instruction: shfl.sync.idx.b32 only works when "realN" is "float". If N is bigger than 1, or real is double, only one b32 instruction seems not working, assuming one b32 instruction can only process one 32 bit register.
Here seems like the instruction: shfl.sync.idx.b32 only works when "realN" is "float". If N is bigger than 1, or real is double, only one b32 instruction seems not working, assuming one b32 instruction can only process one 32 bit register.
You can see the definition of realN
here: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part1.opencl#L162
And thus, you can use the define VWN
to guard your code. So you can do something like:
#if VWN == 1
// your code
#else
// regular fallback code
#endif
Or you could have a specific implementation for VWN == 2
etc. as well?
Current AMD PR doesn't work with precision 64 when there needs two registers for double number. I will change the PR. But still don't know if current Nvidia implementation works. It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?
@tyler-utah what do you think? https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3
It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?
That NVIDIA feature is simply guarded to only activate in single precision: https://github.com/CNugteren/CLBlast/blob/master/src/utilities/compile.cpp#L69
You can do something similar for AMD.
Many cuda optimization methods can be migrated to AMD opencl. Besides smaller LDS, one big barrier is that opencl doesn’t have cross lane function of shfl as cuda has. However, in-line assembly is well supported with rocm compiler on Navi cards. We can use dpp instructions to exchange registers between threads even faster. Anyone interested in this work?