I've tested your smallest config on my system with recent MIOpen, all caches enabled, and ROCm 4.0 and found that even in the best case the overhead is ~1500ms which is ridiculous. Further investigation shows that almost all this time is spent in the hipMemsetAsync() HIP runtime call, invoked from MIOpen's dummy_memset(). This happens during execution of GEMM algorithm. With MIOPEN_DEBUG_CONV_GEMM=0, the actual library's overhead is ~8ms.
Most likely this should be addressed to the GEMM algorithm developers and/or to the HIP runtime team.
Questions:
hipMemsetAsync()
and useSetTensor()
or something else for that?hipMemsetAsync()
?The performance loss is so huge that I am assigning the
bug
label.The issue is originated from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/717#issuecomment-769829395: