ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.06k stars 221 forks source link

Memory access fault for Gemm solvers after #2969 #3076

Closed junliume closed 2 months ago

junliume commented 3 months ago

@atamazov this issue does not appear if we revert #2969

[Observation]: If we run vgg16 model on 8 GPU concurrently:

python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=0 --rank 0 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank0 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=1 --rank 1 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank1 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=2 --rank 2 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank2 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=3 --rank 3 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank3 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=4 --rank 4 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank4 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=5 --rank 5 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank5 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=6 --rank 6 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank6 &
python3 micro_benchmarking_pytorch.py --network vgg16 --batch-size 1024 --iterations 100 --distributed_dataparallel --device_ids=7 --rank 7 --world-size 8 --dist-backend nccl --dist-url tcp://127.0.0.1:4332 2>&1 | tee log_rank7 &

we will oserve

MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }s fault by GPU node-2 (Agent handle: 0x8fd5cb0) on address 0x7f90df780000. Reason: Unknown.
E0617 20:33:08.607907 139846246954176 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 4 (pid: 1215) of binary: sb

export MIOPEN_DEBUG_CONV_GEMM=0 can avoid the crash too.

So the running theory:

junliume commented 3 months ago

Hi @atamazov if we cannot identify the real cause in #2969 soon, we may need to revert it asap. Thanks!

atamazov commented 3 months ago

Hello @junliume. Thanks for triaging the issue. Most likely you are right.

So the real cause is that gemm solvers have issues. #2969 uses real gemm solutions for benchmarking instead of "simulations" and now the issues manifest themselves more often. I suspect that if we revert it, then the issue will be just better camouflaged ;/

Most likely the proper fix is narrowing the applicability of the faulty gemm solvers. Please share the list of vgg16 configs with me if you have it on hand.

Meanwhile I recommend trying export MIOPEN_WORKAROUND_ISSUE_2789=0. It was introduced in #2811 which extends applicability of gemm solvers. If it helps, then we can begin with changing the default of the variable to false.

junliume commented 3 months ago

@atamazov here is the list of unique configs:

   ./bin/MIOpenDriver conv -n 1024 -c 3 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 64 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 14 -W 14 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 14 -W 14 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 14 -W 14 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 512 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 28 -W 28 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 256 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 56 -W 56 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 64 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 64 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0
   ./bin/MIOpenDriver conv -n 1024 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0
atamazov commented 3 months ago

@junliume Thanks! Can you let me know if export MIOPEN_WORKAROUND_ISSUE_2789=0 helps, if possible to try it.

bghimireamd commented 3 months ago

@junliume Thanks! Can you let me know if export MIOPEN_WORKAROUND_ISSUE_2789=0 helps, if possible to try it.

When I do

MIOPEN_WORKAROUND_ISSUE_2789=0 MIOPEN_DEBUG_DISABLE_FIND_DB=1 MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmBwdRest ./bin/MIOpenDriver conv -n 2048 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0

I can reliably hit Memory access fault by GPU node-2 (Agent handle: 0x2c9cb70) on address 0x7ec59a607000. Reason: Unknown. In this case work space size is 115605504 (which is less then 7287183769) so export MIOPEN_WORKAROUND_ISSUE_2789=0 would not have any effect no?

junliume commented 3 months ago

@junliume Thanks! Can you let me know if export MIOPEN_WORKAROUND_ISSUE_2789=0 helps, if possible to try it.

I tried the env var but it is not effective, we still see the mem fault errors :(

atamazov commented 3 months ago

@bghimireamd @junliume Thanks!

junliume commented 3 months ago

MIOPEN_WORKAROUND_ISSUE_2789=0 MIOPEN_DEBUG_DISABLE_FIND_DB=1 MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmBwdRest ./bin/MIOpenDriver conv -n 2048 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0

It's likely caused by here: https://github.com/ROCm/MIOpen/blob/2de28ab6870fede26536fd9fa0fde51d2d5542e5/src/kernels/MIOpenCol2Im2d.cl#L69

bghimireamd commented 3 months ago

MIOPEN_WORKAROUND_ISSUE_2789=0 MIOPEN_DEBUG_DISABLE_FIND_DB=1 MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmBwdRest ./bin/MIOpenDriver conv -n 2048 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0

It's likely caused by here:

https://github.com/ROCm/MIOpen/blob/2de28ab6870fede26536fd9fa0fde51d2d5542e5/src/kernels/MIOpenCol2Im2d.cl#L69

when I comment all the code except for https://github.com/ROCm/MIOpen/blob/2de28ab6870fede26536fd9fa0fde51d2d5542e5/src/kernels/MIOpenCol2Im2d.cl#L89 I still see the error. May be im_offset is off lol

atamazov commented 3 months ago

@junliume @bghimireamd Let follow this rule of thumb: https://github.com/ROCm/MIOpen/blob/7cf8180f80174be34ad68c5b988d5158f2602045/src/solver/pooling/forwardNaive.cpp#L125-L131

atamazov commented 3 months ago

Unfortunately this bug is not reproducible with my 16GiB card.

junliume commented 3 months ago

export MIOPEN_WORKAROUND_ISSUE_2789=0

This case?

MIOPEN_DEBUG_DISABLE_FIND_DB=1 MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmBwdRest ./bin/MIOpenDriver conv -n 2048 -c 64 -H 224 -W 224 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 -V 0

If I change to

long ch_offset = im_ch col_w col_h wei_w wei_h;

it almost feels stuck and may run for a very long time. I think "simulated" GEMM solution might be more useful sometimes than the real ones :)

junliume commented 3 months ago

@atamazov could you compose a PR to make #2969 optional (enable by env var but it;s off by default)? That way we do not need to revert it, and we can resolve this urgent issue for now. Thanks!

atamazov commented 3 months ago

@junliume This seems complicated because the "simulation" code was removed, but possible as a last resort. Note that reverting (or env control) is just another way of camouflaging the real problem. We can simply add missing find-db records (for vgg) and obtain very similar effect.

Let me try to make a full fix. The solver should not be applicable when col2im's limit is hit (i.e. offset overflow happens).

atamazov commented 3 months ago

@junliume I am afraid this would not prevent overflow:

long ch_offset = im_ch col_w col_h wei_w wei_h;

It still performs all 4 muls in 32 bit domain, then cvt to i64.

please try this:

long ch_offset = (long)(im_ch * col_w) * (col_h * wei_w * wei_h);

2+1 muls in 32 bit domain, plus two conversions and 1x 64 bit mul.

Maybe it will show perf better than "stuck". If not, then I would rather simply narrow the applicability.

junliume commented 3 months ago

@junliume I am afraid this would not prevent overflow:

long ch_offset = im_ch col_w col_h wei_w wei_h;

It still performs all 4 muls in 32 bit domain, then cvt to i64.

please try this:

long ch_offset = (long)(im_ch * col_w) * (col_h * wei_w * wei_h);

2+1 muls in 32 bit domain, plus two conversions and 1x 64 bit mul.

Many it will show better perf. If not, then it's better to narrow the applicability.

Sure let me try something similar to this PR https://github.com/ROCm/MIOpen/issues/1956

atamazov commented 3 months ago

@junliume #1956 is about 3D, where we do not have enough fast solvers. For 2D I recommend making the solver inapplicable when 64 bit indices are required.

Also it would be nice to switch from i32 to u32.

bghimireamd commented 3 months ago

I was able to PASS the test after I changed from int (i32) to unsigned int(u32). I had to do the same in https://github.com/ROCm/MIOpen/blob/2de28ab6870fede26536fd9fa0fde51d2d5542e5/src/ocl/utilocl.cpp#L420

atamazov commented 3 months ago

@bghimireamd

I was able to PASS the test after I changed from int (i32) to unsigned int(u32). I had to do the same in

Good! Twice better is something that often helps and should not be underestimated, right? 🤝

atamazov commented 3 months ago

@bghimireamd @junliume

I was able to PASS the test after I changed from int (i32) to unsigned int(u32).

I recommend providing a partial fix based on the above. I expect that a full blown fix is more complex and thus may take too long to implement.