Open junliume opened 2 years ago
It’s quite interesting that if you try the command on a fresh docker container, it would actually work. However, run it a few times, and it consistently fails thereafter.
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.018987 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 332, 30, 0.018987
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2616: 29908722944 us: 167 : [tid:0x7f3cee38c700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2616: 29911124821 us: 169 : [tid:0x7fc0e1ac4700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)
Looks like a HIP runtime problem or issue in the invoker of ConvAsmImplicitGemmGTCDynamicFwdXdlops.
The difference between fresh and used container is: binary cache and find-db. Does removing these helps?
Is this scenario (test passed one or more times, then memory fault(s), then consistently out of resources) stable or random?
/cc @DrizztDoUrden
@atamazov this issue is currently assigned to runtime. Yes clearing cache and config does help:
[First Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops GPU Kernel Time Forward Conv. Elapsed: 0.019431 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 324, 29, 0.019431 Forward Convolution Verifies OK on CPU reference (3.61839e-08) [Second Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 :0:rocdevice.cpp :2614: 472573914516 us: 163 : [tid:0x7f2699c78700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction. code: 0x2a Aborted (core dumped) [Clear Cache]:/opt/rocm/miopen# rm -rf ~/.cache/miopen/ [Clear Config]:/opt/rocm/miopen# rm -rf ~/.config/miopen/ [Third Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops GPU Kernel Time Forward Conv. Elapsed: 0.019076 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 330, 30, 0.019076 Forward Convolution Verifies OK on CPU reference (3.61839e-08) [Fourth Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 :0:rocdevice.cpp :2614: 472627880390 us: 317 : [tid:0x7fe398062700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008 Aborted (core dumped)
@junliume
Yes clearing cache and config does help...
Then this is most likely MIOpen issue. I am afraid it'll be back.
Can you please find out if clearing find-db or clearing binary cache helps?
@cderb @JehandadKhan this is an easily reproducible issue with the above mentioned driver command, could you or assign someone to take a look?
@atamazov Can you please investigate this, if you have time ?
@JehandadKhan With pleasure, but I do not have gfx908/90a available. Or there is some MI100/200 node available for open-source developers?
Or this is reproducible on MI50 or Navi21?
@atamazov @JehandadKhan This issue still exists in the latest build, so far I have tested on gfx908 and gfx1030 and only gfx908 has this problem: (I have attached detailed logs) First Run Is okay
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021280 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888,
[pass_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249012/pass_log.log)
[fail_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249014/fail_log.log)
296, 27, 0.021280
Forward Convolution Verifies OK on CPU reference (3.64455e-08)
Second Run Is NOT okay
root@ixt-rack-148:/opt/rocm# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021333 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 295, 27, 0.021333
Forward Convolution FAILED: 0.026341 > 1.5e-05
I tried to reproduce this issue, and found, if every time we manually delete the user db (or just run for the first time, since there is no user db yet), then we can have the correct result.
e.g, using docker compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:10584_ubuntu20.04_py3.7_pytorch_rocm5.3_internal_testing_9b13302
# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
# then, repeat for a second time, you will get computation error
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
However, if before run the cmd we manually delete the user db (which should exist in ~/.config/miopen/*.ufdb.txt)
# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
# then, manually delete the user db before run the same cmd, you will get the correct result
rm -rf ~/.config/miopen/*.ufdb.txt
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
@JehandadKhan can you please take a look at this behavior?
Wait, https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1619 should disabled above solver.
# first time, manually tune the kernel, will have correct result, and actually that solver reports not applicable
export MIOPEN_FIND_MODE=4
export MIOPEN_FIND_ENFORCE=1
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
# second time, launch normally, will have correct result
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
I guess just retune the db should be fine? @JehandadKhan
I guess just retune the db should be fine?
That may work as a temporary hack for manually picked cases (AFAIK we have one rn, but it is possible to gather them from running tests two times in a row and logging failures, but continuing), but we don't know if there are other cases it would fail at that are not covered by our tests. And, obviously, it is impossible to test every case in sane amount of time.
@junliume
This issue still exists in the latest build...
AFAICS from the logs, you've used latest amd-master (Mainline) which is bfe71031fc21ad1d21d6c6f33b17bbd3243bfb29 and 21 days old. #1619 is not there yet; it is promoted into Staging for now. We shall either promote Staging into Master or wait until release branch is cut and then cherry-pick #1619 directly there.
(I am assuming that the reason of this issue is ConvAsmImplicitGemmGTCDynamicFwdXdlops).
@carlushuang @atamazov Thanks for the detective work! It seems that #1619 is critical. However, recent staging has found that #1619 has caused some performance regressions. Instead of disabling it I think we need to fix it afterall.
https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1675 is for narrow down the non-applicable range
@junliume Is this issue fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!
./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
On gfx908 :
On gfx90a: