ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
341 stars 163 forks source link

LLVM OOM when building "all" targets #1266

Closed pxl-th closed 1 year ago

pxl-th commented 2 years ago

When building rocBLAS (ROCm 5.2.3) with AMDGPU_TARGETS="all" on musl it errors with the following error (see below). Tried reducing number of threads, but that didn't help (although I haven't tried with only 1 thread).

I tried building against only one target and it succeeded on musl. Building on glibc, however, succeeds using all available threads when building against all targets.

Error:

[11:55:53] Compiling source kernels: Launching 128 threads...
[11:55:53] hipcc-cmd: /workspace/destdir/llvm/bin/clang++  -std=c++11 -isystem "/workspace/x86_64-linux-musl-libgfortran4-cxx11/destdir/llvm/lib/clang/14.0.0/include/.." -Xclang -fallow-half-arguments-and-returns -D__HIP_HCC_COMPAT_MODE__=1 -isystem /workspace/destdir/include -isystem "/workspace/destdir/hip/include" --offload-arch=gfx900 --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx90a:xnack- -O3 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false  -isystem /opt/x86_64-linux-musl/lib/gcc/x86_64-linux-musl/*/include -isystem /opt/x86_64-linux-musl/x86_64-linux-musl/include/c++/* -isystem /opt/x86_64-linux-musl/x86_64-linux-musl/include/c++/*/x86_64-linux-musl --sysroot=/opt/x86_64-linux-musl/x86_64-linux-musl/sys-root  --cuda-device-only -D__HIP_HCC_COMPAT_MODE__=1 -I /workspace/srcdir/rocBLAS-rocm-5.2.3/build/Tensile -x hip /workspace/srcdir/rocBLAS-rocm-5.2.3/build/Tensile/Kernels.cpp -c -o "/workspace/srcdir/rocBLAS-rocm-5.2.3/build/library/src/build_tmp/TENSILE/code_object_tmp/Kernels.o"
[11:58:07] LLVM ERROR: out of memory
[11:58:07] Allocation failed
[11:58:07] PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
[11:58:07] Stack dump:
[11:58:07] 0.   Program arguments: /workspace/x86_64-linux-musl-libgfortran4-cxx11/artifacts/1ee90c071f8f73a82531334384e428b453d566b1/llvm/bin/clang-14 -cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-musl -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name Kernels.cpp -mrelocation-model pic -pic-level 1 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_wavefrontsize64_on.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_isa_version_900.bc -mlink-builtin-bitcode /workspace/destdir/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx900 -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -resource-dir /workspace/x86_64-linux-musl-libgfortran4-cxx11/artifacts/1ee90c071f8f73a82531334384e428b453d566b1/llvm/lib/clang/14.0.0 -internal-isystem /workspace/x86_64-linux-musl-libgfortran4-cxx11/artifacts/1ee90c071f8f73a82531334384e428b453d566b1/llvm/lib/clang/14.0.0 -internal-isystem /workspace/destdir/include -isystem /workspace/x86_64-linux-musl-libgfortran4-cxx11/destdir/llvm/lib/clang/14.0.0/include/.. -isystem /workspace/destdir/include -isystem /workspace/destdir/hip/include -isystem /opt/x86_64-linux-musl/lib/gcc/x86_64-linux-musl/7.1.0/include -isystem /opt/x86_64-linux-musl/x86_64-linux-musl/include/c++/7.1.0 -isystem /opt/x86_64-linux-musl/x86_64-linux-musl/include/c++/7.1.0/x86_64-linux-musl -D __HIP_HCC_COMPAT_MODE__=1 -D __HIP_HCC_COMPAT_MODE__=1 -I /workspace/srcdir/rocBLAS-rocm-5.2.3/build/Tensile -isysroot /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root -internal-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/usr/local/include -internal-externc-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/include -internal-externc-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/usr/include -internal-isystem /workspace/x86_64-linux-musl-libgfortran4-cxx11/artifacts/1ee90c071f8f73a82531334384e428b453d566b1/llvm/lib/clang/14.0.0/include -internal-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/usr/local/include -internal-externc-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/include -internal-externc-isystem /opt/x86_64-linux-musl/x86_64-linux-musl/sys-root/usr/include -internal-isystem /workspace/x86_64-linux-musl-libgfortran4-cxx11/artifacts/1ee90c071f8f73a82531334384e428b453d566b1/llvm/lib/clang/14.0.0/include -O3 -std=c++11 -fdeprecated-macro -fno-autolink -fdebug-compilation-dir=/workspace/srcdir/rocBLAS-rocm-5.2.3/build/library/src -ferror-limit 19 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -vectorize-loops -vectorize-slp -fallow-half-arguments-and-returns -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -cuid=75a338dd8edffa3e -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/Kernels-0d9c5a/Kernels-gfx900.o -x hip /workspace/srcdir/rocBLAS-rocm-5.2.3/build/Tensile/Kernels.cpp
[11:58:07] 1.   <eof> parser at end of file
[11:58:07] 2.   Optimizer
[11:58:07] clang-14: error: unable to execute command: Aborted
[11:58:07] clang-14: error: clang frontend command failed due to signal (use -v to see invocation)
[11:58:07] clang version 14.0.0 (/depot/downloads/clones/llvm-project-3923fe0457d4c7fbe3a6ac9017be1736c62d6e6ea552841faead20f6bff134d6 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)
[11:58:07] Target: x86_64-unknown-linux-musl
[11:58:07] Thread model: posix
[11:58:07] InstalledDir: /workspace/destdir/llvm/bin
[11:58:16] clang-14: note: diagnostic msg: 
[11:58:16] ********************
[11:58:16] 
[11:58:16] PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
[11:58:16] Preprocessed source(s) and associated run script(s) are located at:
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-d4025d/Kernels-gfx900.cu
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-5f05b3/Kernels-gfx906:xnack-.cu
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-72413f/Kernels-gfx908:xnack-.cu
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-5365e9/Kernels-gfx90a:xnack+.cu
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-c9f4fb/Kernels-gfx90a:xnack-.cu
[11:58:16] clang-14: note: diagnostic msg: /tmp/Kernels-d4025d/Kernels-gfx900.sh
TorreZuk commented 2 years ago

Hello, I think we need to know more of your env or configuration. So you tried reducing the threads, was this with the install.sh argument --jobs? You tried what number and it still failed? It shows "launching 128 threads" but it should be a reduced number. Or are you using other env variable compiler control options for hipcc? If you call cmake directly and not install.sh you can pass -DTensile_CPU_THREADS=4 or such.
Otherwise provide some more env info, see the new bug issue template for suggestions. What does "ulimit -a" report, could be LLVM is constrained in some other way, do you have >= 64GB RAM ? This is a custom llvm build you are using? You could also try, export HIPCC_COMPILE_FLAGS_APPEND="-parallel-jobs=1"

pxl-th commented 2 years ago

Sorry for the lack of info.

Here's full recipe which is used by BinaryBuilder: link.

I tried doing build on two machines.

When building with glibc it builds fine on both machines using all available threads without OOM. When building with musl it OOMs. I've tried reducing number of threads using various parameters, including -DTensile_CPU_THREADS, but that didn't help. Lowest I've tried was 4 CPU threads, I still can try with 1 thread, but full build takes a long time, so I haven't gotten to it yet.

This is the case only when building all targets. If I specify concrete target it builds fine. So maybe, for musl we really need to have > 64 GB of RAM for some reason...

pxl-th commented 2 years ago

On the note of RAM consumption, when rocBLAS is built with all amdgpu targets and first call to rocblas_sgemm happens (for example), the amount of RAM that is used increases rapidly by ~9-10 GB.

I was wondering if this is an expected behaviour? If I compile rocBLAS for a single target, like gfx1030, the increase in RAM consumption is much smaller (~1 GB).

Here's a concrete example from Julia language. AMDGPU.rocBLAS.gemm! is a very thin wrapper around ccall, so allocations do not come from Julia.

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 julia --project=.

julia> using AMDGPU
julia> to_gb(x) = x / (1024^3)
julia> get_used_memory() = to_gb(Sys.total_physical_memory() - Sys.free_physical_memory())

julia> x = AMDGPU.rand(Float32, 16, 16);
julia> y = AMDGPU.rand(Float32, 16, 16);
julia> b = AMDGPU.rand(Float32, 16, 16);

julia> get_used_memory()
6.750175476074219

julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);

julia> get_used_memory()
15.572715759277344
TorreZuk commented 2 years ago

Thanks for the feedback and recipe, can you instead build with -DTensile_LIBRARY_FORMAT=msgpack which has been the default for a while. It could be parallel processing of yaml which is not the default format is spiking memory use during compilation. Also can you cap the parallel build instead of ${nproc} use max 16, ${nproc}, I am not certain where the OOM happened, or build with a verbose compile flag so it is clearer. But still use e.g. -DTensile_CPU_THREADS=8. Also if you can report a single clang memory use (top) just before OOM so I can see if it is around 2GB per instance or is more, but you would divide your RAM by that to find a proc/thread number to use.

As for runtime memory use can you paste that into a new issue please? I would hope a lot of that memory use is just memory mapped file use that can be reclaimed by the OS if required. Do you know if Julia reports MemFree or MemAvailable from /proc/meminfo? There should be a reduction in this memory allocation (virtual and real) in the next rocBLAS release so that is why I would like if we can keep that topic going in a new issue.

cgmb commented 2 years ago

Hey @TorreZuk. Just FYI, GitHub introduced LaTeX math a few months ago, so you may need to use backticks when you write $. Any pairs of dollar signs might otherwise be interpreted as math blocks.

TorreZuk commented 2 years ago

Hey @TorreZuk. Just FYI, GitHub introduced LaTeX math a few months ago, so you may need to use backticks when you write $. Any pairs of dollar signs might otherwise be interpreted as math blocks.

Thanks for the reminder @cgmb I'll try to remember to preview whenever pasting.

TorreZuk commented 1 year ago

@pxl-th I hope you have managed to proceed. I'll close this issue as it has been a month but feel free to reopen if you have further questions on this topic. As mentioned future releases will reduce the RAM allocations required at runtime, but this can be made into a new issue if you desire. Thanks.

pxl-th commented 1 year ago

@pxl-th I hope you have managed to proceed. I'll close this issue as it has been a month but feel free to reopen if you have further questions on this topic. As mentioned future releases will reduce the RAM allocations required at runtime, but this can be made into a new issue if you desire. Thanks.

Yes it is working fine. I've also got the time to update the recipe for BinaryBuilder and make rocBLAS use msgpack instead of yaml. It didn't solve OOM error on musl during the build, but it did significantly improve memory consumption (that initial spike) during actual usage. Thanks for the help!

With msgpack:

julia> using AMDGPU
julia> to_gb(x) = x / (1024^3)
julia> get_used_memory() = to_gb(Sys.total_physical_memory() - Sys.free_physical_memory())

julia> x = AMDGPU.rand(Float32, 16, 16);
julia> y = AMDGPU.rand(Float32, 16, 16);
julia> b = AMDGPU.rand(Float32, 16, 16);

julia> get_used_memory()
4.810050964355469

julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);

julia> get_used_memory()
7.433422088623047

With yaml (taken from the comment above):

julia> get_used_memory()
6.750175476074219

julia> AMDGPU.rocBLAS.gemm!('N', 'N', 1f0, x, b, 0f0, y);

julia> get_used_memory()
15.572715759277344
TorreZuk commented 1 year ago

@pxl-th thanks for the update. I expect the memory spike will reduce further with later releases.