Failing configs in MIOpen

JehandadKhan commented 4 years ago

arch: gfx908

Example Configs:

MIOpenDriver conv -V 0 -i 1 --forw 0 --pad_h 0 --out_channels 256 --fil_w 1 --dilation_w 1 --fil_h 1 --in_h 256 --conv_stride_w 1 --group_count 1 --in_channels 256 --in_w 256 --dilation_h 1 --conv_stride_h 1 --pad_w 0 --batchsize 128 --pad_mode default --mode conv --fil_d 1 --in_d 1 --spatial_dim 2 --conv_stride_d 1 --dilation_d 1 --pad_d 0 --trans_output_pad_d 0
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc

Another example

MIOpenDriver conv -V 0 -i 1 --forw 0 --pad_h 0 --out_channels 12 --fil_w 1 --dilation_w 1 --fil_h 1 --in_h 256 --conv_stride_w 1 --group_count 1 --in_channels 512 --in_w 256 --dilation_h 1 --conv_stride_h 1 --pad_w 0 --batchsize 128 --pad_mode default --mode conv --fil_d 1 --in_d 1 --spatial_dim 2 --conv_stride_d 1 --dilation_d 1 --pad_d 0 --trans_output_pad_d 0
MIOpen Error: /root/dMIOpen/src/ocl/convolutionocl.cpp:739: Buffers cannot be NULL
RunForwardGPU() failed, rc = 0x3
MIOpen Error: /root/dMIOpen/src/ocl/convolutionocl.cpp:2092: Buffers cannot be NULL
MIOpen Error: /root/dMIOpen/src/ocl/convolutionocl.cpp:3338: Buffers cannot be NULL
RunBackwardGPU() failed, rc = 0x30003

a complete list is attached. gfx908.log gfx900_56.log gfx900_64.log gfx906_60.log gfx906_64.log

atamazov commented 4 years ago

MIOpenDriver needs more memory for tensors that is available on gfx908. The first one needs ~32.5 GiB, the second -- ~32.8 GiB. It is possile to use -F 1, -F 2, -F 4 to run each direction separately.

I've tried the first config on Radeon VII (that has twice less memory) with half-sized tensors and it works as expected (-F 0 fails, but -F 1/2/4 work fine).

However there is less luck with the second config. Obviously, these are some problems in the driver, for example, I've just found some places where int is used instead of size_t, which may lead to UB.

atamazov commented 4 years ago

Let me summarize possible reasons of issues with very big tensors:

Address or index computation problems in the driver
Address or index computation problems in the host code of the library
Omissions in IsApplicable() of individual solvers
- E.g. there are known limitation of rocBLAS (buffer addressing is limited by 4 GiB or so?)
Code generation issues in the compiler
OCL/HIP runtime issues.

jane-zxy commented 4 years ago

I also see this is a complicate problem:

do we have limitation for the tensorsize? since tensor(miopen::deref(inputTensor).GetLengths()) maybe run out of memory if length is really huge.
we didn't do any HIP API runtime check while we call any of its func. eg. GPUMem(uint32_t ctx, size_t psz, size_t pdata_sz) : _ctx(ctx), sz(psz), data_sz(pdata_sz) { hipMalloc(static_cast<void*>(&buf), data_sz sz); }

JehandadKhan commented 4 years ago

I agree, there are issues. But the driver should not crash throwing a std:bad_alloc. The termination should be graceful and a correct error message should be reported.

If allocating memory for all the directions makes it fail, then perhaps we should only allocate memory for one direction at a time, run it and then either reuse or re-allocate memory for the other directions.

JehandadKhan commented 3 years ago

@aserio These errors show up everytime we tune, polluting the results with failed configs. Can we set a priority on this and allocate some resources?

aserio commented 3 years ago

I will bring this up in my 1:1 with @daniellowell tomorrow.

atamazov commented 3 years ago

@JehandadKhan

These errors show up everytime we tune, polluting the results with failed configs.

Now I understand the reason why this is high priority

If allocating memory for all the directions makes it fail, then perhaps we should only allocate memory for one direction at a time, run it and then either reuse or re-allocate memory for the other directions.

Can you use this strategy during tuning? Run -F 1, then F 2 and then -F 4?

atamazov commented 3 years ago

And please let us know how you would like the driver to fail in case of OOGM. Is it enough for the tuning infrastructure if it will return some error code (and an error message of course -- for humans)?

atamazov commented 3 years ago

OOGM := Out Of GPU memory ;)

aserio commented 3 years ago

@daniellowell will followup

junliume commented 2 years ago

@JehandadKhan @atamazov Still failing with the latest develop:


Error copying data to GPU, status = 1
AllocateBuffersAndCopy() FAILED, rc = 1

ROCm / MIOpen

Failing configs in MIOpen #494