Closed Axl-zhang closed 2 months ago
Hello @Axl-zhang,
Thank you for bringing the performance concern regarding our recently supported architecture to our attention. Your feedback is instrumental in ensuring optimal functionality across all platforms.
To address this matter, I'll be promptly directing this issue to our specialized team for thorough investigation and analysis. Sometimes, certain transpose configurations, such as NT, undergo specific tuning efforts to enhance performance on varying architectures.
In the interim, I kindly suggest exploring different transpose configurations and comparing their impact on performance. Trying out alternatives like NT could potentially reveal improvements. Please feel free to report back on any findings or improvements observed with these adjustments. Your assistance in this matter is highly appreciated.
Thank you for your patience and cooperation as we work to resolve this issue.
Best regards, Wasiq
Assigning to Carson for gfx1100 fp32 Tensile tuning.
the headline 61TFlops spec of rdna3 is kindof a lie, to achieve this rdna3 adds limited fp32 dual issue capability over rdna2. This to use dual issue capability some stars have to align though:
TLDR: except for very specific circumstances 30 ish TFlops is the max you can expect out of RDNA3
the headline 61TFlops spec of rdna3 is kindof a lie, to achieve this rdna3 adds limited fp32 dual issue capability over rdna2. This to use dual issue capability some stars have to align though:
you need to be able to dual issue and do it by hand in asm
- or you need the compiler to opmize it in, llvm however is pretty terrible at optimizing dual issue in so this rarely just happens.
- unlike CDNA2 or fp16 dual issue in VEGA, RDNA3 can only dual issue some operations (like adds) but iiirc not mults so again most of the time no dual issue, def not in gemm
TLDR: except for very specific circumstances 30 ish TFlops is the max you can expect out of RDNA3
several blogs claim the clpeak test can achive 80% theoretical performance if use the wave64 ,but now navi3x Tensile set wavefonts is 32 only achive half theoretical performance
In wave64 mode the hardware can dual issue halfs of the wave, this dosent help you with operations that can not be dual issued at all though, as is the case if i understand the isa documentation correctly in gemm
In wave64 mode the hardware can dual issue halfs of the wave, this dosent help you with operations that can not be dual issued at all though, as is the case if i understand the isa documentation correctly in gemm wave64 mode was disable in clr due to crossline ,so rdna3 only use wave32 in hip but I find the handwrite dual issue instruction in early tensile pr https://github.com/ROCm/Tensile/pull/1625 and revert in https://github.com/ROCm/Tensile/pull/1683
Thanks again bringing this issue to our attention. We noticed that there hasn't been any activity on this issue for a while. To keep our issue tracker clean and focused on active matters, we will be closing this issue if there is no further activity within the next week.
If you still require assistance or believe this issue needs to remain open, please continue the discussion in this new location.
Thank you for your understanding and cooperation.
Suggestion Description
Benchmark show bellow, the performance of the fp32 is very bad, the theoretical performance is 61tflops,actual test is 28tflops less than half of the theoretical. RTX4090 fp32 has 74tflops (theoretical 81t). Is there any room for further improvement, or are there any suggestions for optimization?
#####################
test platform:
#########################
FP16 benchmark:
#########################
FP32 benchmark:
rocminfo
Operating System
Ubuntu 22.04.3 LTS
GPU
7900xtx
ROCm Component
rocBLAS