ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.07k stars 225 forks source link

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

Open junliume opened 2 years ago

junliume commented 2 years ago

./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

On gfx908 :

MIOpen(HIP): Info [ConvolutionForward] algo = 5, workspace = 0 MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoImplicitGEMM :0:rocdevice.cpp :2616: 103657926263 us: 37351: [tid:0x7fe289112700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008 Aborted (core dumped)

On gfx90a:

MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0 MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoGEMM MIOpen(HIP): auto miopen::solver::GemmFwd1x1_0_1::GetSolution(const miopen::ExecutionContext &, const conv::ProblemDescription &)::(anonymous class)::operator()(const std::vector &)::(anonymous class)::operator()(const miopen::Handle &, const miopen::AnyInvokeParams &) const{ MIOpen(HIP): "convolution, 1x1" = convolution, 1x1 MIOpen(HIP): } MIOpen(HIP): Info2 [CallGemmStridedBatched] gemm_desc: {isColMajor 0, transA 0, transB 0, m 32, n 81, k 76, lda 76, ldb 81, ldc 81, batch_count 16, strideA 0, strideB 6156, strideC 2592, alpha 1, beta 0, dataType 1} MIOpen(HIP): miopenStatus_t miopen::CallGemmStridedBatched(const miopen::Handle &, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey *, miopen::GemmBackend_t, bool){ MIOpen(HIP): "rocBLAS" = rocBLAS MIOpen(HIP): } MIOpen Forward Conv. Algorithm: 0, Solution: 88/GemmFwd1x1_0_1

junliume commented 2 years ago

It’s quite interesting that if you try the command on a fresh docker container, it would actually work. However, run it a few times, and it consistently fails thereafter.


[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.018987 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 332, 30, 0.018987
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29908722944 us: 167  : [tid:0x7f3cee38c700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29911124821 us: 169  : [tid:0x7fc0e1ac4700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)
atamazov commented 2 years ago

Looks like a HIP runtime problem or issue in the invoker of ConvAsmImplicitGemmGTCDynamicFwdXdlops.

The difference between fresh and used container is: binary cache and find-db. Does removing these helps?

Is this scenario (test passed one or more times, then memory fault(s), then consistently out of resources) stable or random?

/cc @DrizztDoUrden

junliume commented 2 years ago

@atamazov this issue is currently assigned to runtime. Yes clearing cache and config does help:

[First Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops GPU Kernel Time Forward Conv. Elapsed: 0.019431 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 324, 29, 0.019431 Forward Convolution Verifies OK on CPU reference (3.61839e-08) [Second Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 :0:rocdevice.cpp :2614: 472573914516 us: 163 : [tid:0x7f2699c78700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction. code: 0x2a Aborted (core dumped) [Clear Cache]:/opt/rocm/miopen# rm -rf ~/.cache/miopen/ [Clear Config]:/opt/rocm/miopen# rm -rf ~/.config/miopen/ [Third Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops GPU Kernel Time Forward Conv. Elapsed: 0.019076 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 330, 30, 0.019076 Forward Convolution Verifies OK on CPU reference (3.61839e-08) [Fourth Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 :0:rocdevice.cpp :2614: 472627880390 us: 317 : [tid:0x7fe398062700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008 Aborted (core dumped)

atamazov commented 2 years ago

@junliume

Yes clearing cache and config does help...

Then this is most likely MIOpen issue. I am afraid it'll be back. image

Can you please find out if clearing find-db or clearing binary cache helps?

junliume commented 2 years ago

@cderb @JehandadKhan this is an easily reproducible issue with the above mentioned driver command, could you or assign someone to take a look?

JehandadKhan commented 2 years ago

@atamazov Can you please investigate this, if you have time ?

atamazov commented 2 years ago

@JehandadKhan With pleasure, but I do not have gfx908/90a available. Or there is some MI100/200 node available for open-source developers?

atamazov commented 2 years ago

Or this is reproducible on MI50 or Navi21?

junliume commented 2 years ago

@atamazov @JehandadKhan This issue still exists in the latest build, so far I have tested on gfx908 and gfx1030 and only gfx908 has this problem: (I have attached detailed logs) First Run Is okay

MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021280 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 
[pass_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249012/pass_log.log)
[fail_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249014/fail_log.log)
296, 27, 0.021280
Forward Convolution Verifies OK on CPU reference (3.64455e-08)

pass_log.log fail_log.log

Second Run Is NOT okay


root@ixt-rack-148:/opt/rocm# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021333 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 295, 27, 0.021333
Forward Convolution FAILED: 0.026341 > 1.5e-05
carlushuang commented 2 years ago

I tried to reproduce this issue, and found, if every time we manually delete the user db (or just run for the first time, since there is no user db yet), then we can have the correct result.

e.g, using docker compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:10584_ubuntu20.04_py3.7_pytorch_rocm5.3_internal_testing_9b13302

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, repeat for a second time, you will get computation error
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

However, if before run the cmd we manually delete the user db (which should exist in ~/.config/miopen/*.ufdb.txt)

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, manually delete the user db before run the same cmd, you will get the correct result
rm -rf ~/.config/miopen/*.ufdb.txt
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

@JehandadKhan can you please take a look at this behavior?

carlushuang commented 2 years ago

Wait, https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1619 should disabled above solver.

# first time, manually tune the kernel, will have correct result, and actually that solver reports not applicable
export MIOPEN_FIND_MODE=4
export MIOPEN_FIND_ENFORCE=1
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

# second time, launch normally, will have correct result
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

I guess just retune the db should be fine? @JehandadKhan

DrizztDoUrden commented 2 years ago

I guess just retune the db should be fine?

That may work as a temporary hack for manually picked cases (AFAIK we have one rn, but it is possible to gather them from running tests two times in a row and logging failures, but continuing), but we don't know if there are other cases it would fail at that are not covered by our tests. And, obviously, it is impossible to test every case in sane amount of time.

atamazov commented 2 years ago

@junliume

This issue still exists in the latest build...

AFAICS from the logs, you've used latest amd-master (Mainline) which is bfe71031fc21ad1d21d6c6f33b17bbd3243bfb29 and 21 days old. #1619 is not there yet; it is promoted into Staging for now. We shall either promote Staging into Master or wait until release branch is cut and then cherry-pick #1619 directly there.

(I am assuming that the reason of this issue is ConvAsmImplicitGemmGTCDynamicFwdXdlops).

junliume commented 2 years ago

@carlushuang @atamazov Thanks for the detective work! It seems that #1619 is critical. However, recent staging has found that #1619 has caused some performance regressions. Instead of disabling it I think we need to fix it afterall.

carlushuang commented 2 years ago

https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1675 is for narrow down the non-applicable range

ppanchad-amd commented 6 months ago

@junliume Is this issue fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!