AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
542 stars 346 forks source link

[HIP] GPU ASAN reports buffer overflow in amrex::InitRandom #3623

Closed BenWibking closed 11 months ago

BenWibking commented 11 months ago

AMDGPU ASAN (included in ROCm 5.7+) reports a buffer overflow triggered within amrex::InitRandom at this GPU kernel: https://github.com/AMReX-Codes/amrex/blob/d36463103daed09a40cdea235041a6ab79ff280c/Src/Base/AMReX_Random.cpp#L54

bwibking@moth:~/Microphysics/unit_test/burn_cell> ./main3d.hip.HIP.ex inputs_vode_example
Initializing AMReX (23.11-5-gd36463103dae)...
Initializing HIP...
HIP initialized with 1 device.
=================================================================
==1548068==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000020f0b48 at pc 0x7f5b9dea5ea7 bp 0x7ffe0c5b6de0 sp 0x7ffe0c5b65a0
READ of size 32 at 0x0000020f0b48 thread T0
    #0 0x7f5b9dea5ea6 in __interceptor_memcpy (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a
)
    #1 0x7f5b997440a9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3440a9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #2 0x7f5b997462f6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3462f6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #3 0x7f5b997465a6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3465a6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #4 0x7f5b99712434  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x312434) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #5 0x7f5b996dcc53  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x2dcc53) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #6 0x7f5b995835e9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1835e9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #7 0x7f5b99489c0e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x89c0e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #8 0x7f5b995e650e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e650e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #9 0x7f5b99610bd9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x210bd9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #10 0x7f5b995e6f91  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e6f91) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #11 0x7f5b995f13e7 in hipLaunchKernel (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1f13e7) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #12 0xa49041 in std::enable_if<MaybeDeviceRunnable<(anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)>::value, void>::type amrex::ParallelFor<256, int, (anony
mous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(amrex::Gpu::KernelInfo const&, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /
home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:878:5
    #13 0xa49041 in void amrex::ParallelFor<int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1457:5
    #14 0xa49041 in (anonymous namespace)::ResizeRandomSeed(unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:54:5
    #15 0xa49041 in amrex::InitRandom(unsigned long, int, unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:104:5
    #16 0x987586 in amrex::Initialize(int&, char**&, bool, int, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/bwibking/amrex/Src/Base/AMReX.cpp:625:5
    #17 0x908243 in main /home/bwibking/Microphysics/unit_test/burn_cell/main.cpp:19:3
    #18 0x7f5b98c3feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #19 0x7f5b98c3ff5f in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff5f) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #20 0x8dc8c4 in _start (/home/bwibking/Microphysics/unit_test/burn_cell/main3d.hip.HIP.ex+0x8dc8c4)

0x0000020f0b48 is located 56 bytes before global variable 'helmholtz::itmax' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b80) of size 8
0x0000020f0b48 is located 24 bytes before global variable 'helmholtz::input_is_constant' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b60) of size 8
0x0000020f0b48 is located 0 bytes after global variable 'helmholtz::do_coulomb' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b40) of size 8
SUMMARY: AddressSanitizer: global-buffer-overflow (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a) in __interceptor_memcpy
Shadow bytes around the buggy address:
  0x0000020f0880: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  0x0000020f0900: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  0x0000020f0980: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  0x0000020f0a00: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  0x0000020f0a80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
=>0x0000020f0b00: f9 f9 f9 f9 f9 f9 f9 f9 00[f9]f9 f9 00 f9 f9 f9
  0x0000020f0b80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x0000020f0c00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x0000020f0c80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x0000020f0d00: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
  0x0000020f0d80: 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9 00 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1548068==ABORTING

The above can be reproduced with:

git clone https://github.com/AMReX-Astro/Microphysics.git
cd Microphysics/unit_test/burn_cell
export AMREX_HOME=/path/to/amrex
export AMREX_AMD_ARCH=gfx90a:xnack+
export HSA_XNACK=1
export LD_LIBRARY_PATH=/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux:$LD_LIBRARY_PATH
make USE_HIP=TRUE CXXFLAGS="-std=c++17 -m64 -fgpu-rdc --offload-arch=gfx90a:xnack+ -pthread -g -O3 -munsafe-fp-atomics -fsanitize=address -shared-libsan" LDFLAGS="-fsanitize=address -shared-libsan" -j16
./main3d.hip.HIP.ex inputs_vode_example

This should compile and link in < 5 minutes.

I can reproduce this with both the Microphysics and Quokka unit tests, but (very strangely) not with the AMReX tests. Both codes have experienced crashes with strange memory errors with ROCm 5.7 that might be related to this.

@WeiqunZhang @zingale

BenWibking commented 11 months ago

For reference, both Castro and Quokka crash with memory errors like this in production sims with ROCm 5.7.0:

Memory access fault by GPU node-8 (Agent handle: 0x2975b60) on address 0x800033773000. Reason: Unknown.
BenWibking commented 11 months ago

Following @zingale's suggestion, I commented out the line that calls amrex::InitRandom, and the tests run without any errors from ASAN (other than unrelated memory leaks).

diff --git a/Src/Base/AMReX.cpp b/Src/Base/AMReX.cpp
index 4449dab19..0d6fe8138 100644
--- a/Src/Base/AMReX.cpp
+++ b/Src/Base/AMReX.cpp
@@ -622,7 +622,7 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
     //
     // Initialize random seed after we're running in parallel.
     //
-    amrex::InitRandom(ParallelDescriptor::MyProc()+1, ParallelDescriptor::NProcs());
+    //amrex::InitRandom(ParallelDescriptor::MyProc()+1, ParallelDescriptor::NProcs());

     // For thread safety, we should do these initializations here.
     BaseFab_Initialize();
WeiqunZhang commented 11 months ago

I could not see anything wrong with the code in amrex::InitRandom. This might be a false positive or a bug in xnack.

BenWibking commented 11 months ago

This is very puzzling. Is there a way to manually examine HIP device symbols in the binary, like nm for host code?

BenWibking commented 11 months ago

I'll close this and move it to Microphysics.