Open JehandadKhan opened 4 years ago
MIOpenDriver needs more memory for tensors that is available on gfx908. The first one needs ~32.5 GiB, the second -- ~32.8 GiB. It is possile to use -F 1
, -F 2
, -F 4
to run each direction separately.
I've tried the first config on Radeon VII (that has twice less memory) with half-sized tensors and it works as expected (-F 0
fails, but -F 1/2/4
work fine).
However there is less luck with the second config. Obviously, these are some problems in the driver, for example, I've just found some places where int
is used instead of size_t
, which may lead to UB.
Let me summarize possible reasons of issues with very big tensors:
IsApplicable()
of individual solvers
I also see this is a complicate problem:
I agree, there are issues. But the driver should not crash throwing a std:bad_alloc
. The termination should be graceful and a correct error message should be reported.
If allocating memory for all the directions makes it fail, then perhaps we should only allocate memory for one direction at a time, run it and then either reuse or re-allocate memory for the other directions.
@aserio These errors show up everytime we tune, polluting the results with failed configs. Can we set a priority on this and allocate some resources?
I will bring this up in my 1:1 with @daniellowell tomorrow.
@JehandadKhan
These errors show up everytime we tune, polluting the results with failed configs.
Now I understand the reason why this is high priority
If allocating memory for all the directions makes it fail, then perhaps we should only allocate memory for one direction at a time, run it and then either reuse or re-allocate memory for the other directions.
Can you use this strategy during tuning? Run -F 1
, then F 2
and then -F 4
?
And please let us know how you would like the driver to fail in case of OOGM. Is it enough for the tuning infrastructure if it will return some error code (and an error message of course -- for humans)?
OOGM
:= Out Of GPU memory ;)
@daniellowell will followup
@JehandadKhan @atamazov
Still failing with the latest develop
:
Error copying data to GPU, status = 1
AllocateBuffersAndCopy() FAILED, rc = 1
arch: gfx908
Example Configs:
Another example
a complete list is attached. gfx908.log gfx900_56.log gfx900_64.log gfx906_60.log gfx906_64.log