ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 228 forks source link

[GEMM] dummy_memset() is ridiculously slow #723

Closed atamazov closed 3 years ago

atamazov commented 3 years ago

Questions:

The performance loss is so huge that I am assigning the bug label.

The issue is originated from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/717#issuecomment-769829395:

I've tested your smallest config on my system with recent MIOpen, all caches enabled, and ROCm 4.0 and found that even in the best case the overhead is ~1500ms which is ridiculous. Further investigation shows that almost all this time is spent in the hipMemsetAsync() HIP runtime call, invoked from MIOpen's dummy_memset(). This happens during execution of GEMM algorithm. With MIOPEN_DEBUG_CONV_GEMM=0, the actual library's overhead is ~8ms.

Most likely this should be addressed to the GEMM algorithm developers and/or to the HIP runtime team.

Logs (binary cache enabled, MIOPEN_FIND_MODE=normal):

Example MIOpenDriver command:

MIOPEN_FIND_MODE=normal \
MIOPEN_ENABLE_LOGGING_ELAPSED_TIME=1 \
MIOPEN_LOG_LEVEL=6 \
./bin/MIOpenDriver conv -n 1 -c 3 -H 2 -W 2 -k 8 -x 3 -y 3 -p 1 -q 1 -u 2 -v 2 -V 0 -w 2 -t 1 -i 2 -F 1 \
2>&1 | tee ~/mio/overhead-01.txt
atamazov commented 3 years ago

A large loss of time is expected, therefore value_high.

atamazov commented 3 years ago

~It seems like~ #554 resolves this. ~@asroy Am I correct?~