Closed pxl-th closed 1 year ago
Hello, I think we need to know more of your env or configuration. So you tried reducing the threads, was this with the install.sh argument --jobs? You tried what number and it still failed? It shows "launching 128 threads" but it should be a reduced number. Or are you using other env variable compiler control options for hipcc?
If you call cmake directly and not install.sh you can pass -DTensile_CPU_THREADS=4 or such.
Otherwise provide some more env info, see the new bug issue template for suggestions. What does "ulimit -a" report, could be LLVM is constrained in some other way, do you have >= 64GB RAM ? This is a custom llvm build you are using? You could also try, export HIPCC_COMPILE_FLAGS_APPEND="-parallel-jobs=1"
Sorry for the lack of info.
Here's full recipe which is used by BinaryBuilder: link.
I tried doing build on two machines.
When building with glibc
it builds fine on both machines using all available threads without OOM.
When building with musl
it OOMs.
I've tried reducing number of threads using various parameters, including -DTensile_CPU_THREADS
, but that didn't help.
Lowest I've tried was 4 CPU threads, I still can try with 1 thread, but full build takes a long time, so I haven't gotten to it yet.
This is the case only when building all
targets. If I specify concrete target it builds fine. So maybe, for musl
we really need to have > 64 GB of RAM for some reason...
On the note of RAM consumption, when rocBLAS is built with all
amdgpu targets and first call to rocblas_sgemm
happens (for example), the amount of RAM that is used increases rapidly by ~9-10 GB.
I was wondering if this is an expected behaviour?
If I compile rocBLAS for a single target, like gfx1030
, the increase in RAM consumption is much smaller (~1 GB).
Here's a concrete example from Julia language.
AMDGPU.rocBLAS.gemm!
is a very thin wrapper around ccall
, so allocations do not come from Julia.
$ HSA_OVERRIDE_GFX_VERSION=10.3.0 julia --project=.
julia> using AMDGPU
julia> to_gb(x) = x / (1024^3)
julia> get_used_memory() = to_gb(Sys.total_physical_memory() - Sys.free_physical_memory())
julia> x = AMDGPU.rand(Float32, 16, 16);
julia> y = AMDGPU.rand(Float32, 16, 16);
julia> b = AMDGPU.rand(Float32, 16, 16);
julia> get_used_memory()
6.750175476074219
julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);
julia> get_used_memory()
15.572715759277344
Thanks for the feedback and recipe, can you instead build with -DTensile_LIBRARY_FORMAT=msgpack which has been the default for a while.
It could be parallel processing of yaml which is not the default format is spiking memory use during compilation.
Also can you cap the parallel build instead of ${nproc}
use max 16, ${nproc}
, I am not certain where the OOM happened, or build with a verbose compile flag so it is clearer. But still use e.g. -DTensile_CPU_THREADS=8. Also if you can report a single clang memory use (top) just before OOM so I can see if it is around 2GB per instance or is more, but you would divide your RAM by that to find a proc/thread number to use.
As for runtime memory use can you paste that into a new issue please? I would hope a lot of that memory use is just memory mapped file use that can be reclaimed by the OS if required. Do you know if Julia reports MemFree or MemAvailable from /proc/meminfo? There should be a reduction in this memory allocation (virtual and real) in the next rocBLAS release so that is why I would like if we can keep that topic going in a new issue.
Hey @TorreZuk. Just FYI, GitHub introduced LaTeX math a few months ago, so you may need to use backticks when you write $
. Any pairs of dollar signs might otherwise be interpreted as math blocks.
Hey @TorreZuk. Just FYI, GitHub introduced LaTeX math a few months ago, so you may need to use backticks when you write
$
. Any pairs of dollar signs might otherwise be interpreted as math blocks.
Thanks for the reminder @cgmb I'll try to remember to preview whenever pasting.
@pxl-th I hope you have managed to proceed. I'll close this issue as it has been a month but feel free to reopen if you have further questions on this topic. As mentioned future releases will reduce the RAM allocations required at runtime, but this can be made into a new issue if you desire. Thanks.
@pxl-th I hope you have managed to proceed. I'll close this issue as it has been a month but feel free to reopen if you have further questions on this topic. As mentioned future releases will reduce the RAM allocations required at runtime, but this can be made into a new issue if you desire. Thanks.
Yes it is working fine. I've also got the time to update the recipe for BinaryBuilder and make rocBLAS use msgpack instead of yaml. It didn't solve OOM error on musl during the build, but it did significantly improve memory consumption (that initial spike) during actual usage. Thanks for the help!
With msgpack:
julia> using AMDGPU
julia> to_gb(x) = x / (1024^3)
julia> get_used_memory() = to_gb(Sys.total_physical_memory() - Sys.free_physical_memory())
julia> x = AMDGPU.rand(Float32, 16, 16);
julia> y = AMDGPU.rand(Float32, 16, 16);
julia> b = AMDGPU.rand(Float32, 16, 16);
julia> get_used_memory()
4.810050964355469
julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);
julia> get_used_memory()
7.433422088623047
With yaml (taken from the comment above):
julia> get_used_memory()
6.750175476074219
julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);
julia> get_used_memory()
15.572715759277344
@pxl-th thanks for the update. I expect the memory spike will reduce further with later releases.
When building rocBLAS (ROCm 5.2.3) with
AMDGPU_TARGETS="all"
onmusl
it errors with the following error (see below). Tried reducing number of threads, but that didn't help (although I haven't tried with only1
thread).I tried building against only one target and it succeeded on
musl
. Building onglibc
, however, succeeds using all available threads when building against all targets.Error: