Closed BenWibking closed 8 months ago
I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.
I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.
Ah, that's an interesting clue. @psharda you were going to try this, right? Did it ever finish building?
Here's a simple test that generates a memory issue with ROCm 5.7.0:
module load cpe/23.09
module load rocm/5.7.0
module load PrgEnv-gnu craype-accel-amd-gfx90a cray-mpich
cd Microphysics/unit_test/test_react
make NETWORK_DIR=subch_simple USE_HIP=TRUE COMP=gnu -j 4
then run on a single GPU, using the inputs_aprox13
inputs file
The output is:
Initializing AMReX (23.12-11-g064db4eaa599)...
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.12-11-g064db4eaa599) initialized
reading extern runtime parameters ...
reading in network electron-capture / beta-decay tables...
Memory access fault by GPU node-4 (Agent handle: 0x1f677b0) on address 0x7fffd6ce5000. Reason: Unknown.
SIGABRT
See Backtrace.0 file for details
srun: error: frontier05193: task 0: Exited with exit code 1
srun: Terminating StepId=1533275.0
I hestiate to ask... does this compile in less than a Hubble time in debug mode?
with DEBUG=TRUE, I get:
:0:rocdevice.cpp :2692: 719740276446 us: [pid:85303 tid:0x7fffde461700] Callback: Queue 0x7ffeaba00000 aborting w
ith error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal addres
s. code: 0x29
and this runs fine with ROCm 5.3.0
test_react
appears to work fine with ROCm 5.4.0
with rocgdb
, I get:
guration: Returned hipSuccess :
:3:hip_module.cpp :678 : 298664095624 us: [pid:16441 tid:0x7fffed9cda80] hipLaunchKernel ( 0x221e30, {4,1,1}, {256,1,1}, 0x7fffffff3a10, 0, stream:0x87d36a0 )
:3:rocvirtual.cpp :783 : 298664095630 us: [pid:16441 tid:0x7fffed9cda80] Arg0: = val:140648402845968
:3:rocvirtual.cpp :2897: 298664095632 us: [pid:16441 tid:0x7fffed9cda80] ShaderName : _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.14460905eb7cb0a1
:3:hip_module.cpp :679 : 298664095639 us: [pid:16441 tid:0x7fffed9cda80] hipLaunchKernel: Returned hipSuccess :
:3:hip_error.cpp :27 : 298664095641 us: [pid:16441 tid:0x7fffed9cda80] hipGetLastError ( )
:3:hip_error.cpp :27 : 298664095644 us: [pid:16441 tid:0x7fffed9cda80] hipGetLastError ( )
:3:hip_stream.cpp :451 : 298664095648 us: [pid:16441 tid:0x7fffed9cda80] hipStreamSynchronize ( stream:0x87d36a0 )
:3:rocdevice.cpp :2651: 298664095650 us: [pid:16441 tid:0x7fffed9cda80] No HW event
:3:rocvirtual.hpp :67 : 298664095653 us: [pid:16441 tid:0x7fffed9cda80] Host active wait for Signal = (0x7fffcbee4000) for -1 ns
Memory access fault by GPU node-4 (Agent handle: 0x42bfbc0) on address 0x7ff7e03d5000. Reason: Unknown.
Thread 2 "main3d.hip.x86-" hit Breakpoint 1, 0x00007fffe80f81de in abort () from /lib64/libc.so.6
(gdb) interrupt
(gdb)
Thread 1 "main3d.hip.x86-" stopped.
0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
bt
#0 0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#1 0x00007fffdf67684a in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#2 0x00007fffdf669fa9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#3 0x00007fffe9305793 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#4 0x00007fffe92fc318 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#5 0x00007fffe92ffcbf in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#6 0x00007fffe9301a03 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#7 0x00007fffe92ff225 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#8 0x00007fffe92d330b in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#9 0x00007fffe92d3920 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#10 0x00007fffe92d39cc in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#11 0x00007fffe92d6b28 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#12 0x00007fffe9239503 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#13 0x00007fffe923992c in hipStreamSynchronize () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#14 0x0000000002f265c6 in amrex::Gpu::Device::streamSynchronize ()
at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.cpp:613
#15 0x0000000002fa45ec in amrex::Gpu::streamSynchronize ()
at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.H:241
#16 amrex::MFIter::Finalize (this=0x7fffffff3a70)
at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:242
#17 0x0000000002fa456c in amrex::MFIter::~MFIter (this=0x4292690)
at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:212
#18 0x0000000002ea2205 in amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&) (this=<optimized out>, mf=...,
nghost=..., reduce_data=..., f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_Reduce.H:453
#19 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex--Type <RET> for more, q to quit, c to continue without paging--
::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., nghost=..., operation_list=..., type_list=...,
f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:103
#20 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., operation_list=..., type_list=..., f=...)
at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:288
#21 main_main () at main.cpp:203
#22 0x0000000002ea0d41 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:26```
Here's a backtrace from inside a thread:
#0 0x00007ff7b0c54630 in dgesl<23> (a1=..., pivot1=..., b1=...) at ../../util/linpack.H:24
#1 dvnlsd<amrex::Array1D<short, 1, 23>, burn_t, dvode_t<23> > (pivot=..., NFLAG=<optimized out>, state=..., vstate=...) at ../../integration/VODE/vode_dvnlsd.H:117
#2 dvstep<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvstep.H:177
#3 dvode<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvode.H:186
#4 actual_integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/VODE/actual_integrator.H:88
#5 integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/integrator.H:14
#6 burner<burn_t> (state=..., dt=<optimized out>) at ../../interfaces/burner.H:92
#7 do_react (i=<optimized out>, j=<optimized out>, k=<optimized out>, state=..., burn_state=..., n_rhs=..., p=...) at ./react_zones.H:49
#8 main_main()::{lambda(int, int, int, int)#1}::operator()(int, int, int, int) const (this=<optimized out>, box_no=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized out>) at main.cpp:211
#9 amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}::operator()(int, int, int) const (this=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized
out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:459
#10 amrex::Reduce::detail::call_f<amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>
, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1} const&, int, int, int, amrex::IndexType) (f=..., i=<optimized out>, j=<optimized out>, k=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:324
#11 amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::Box
const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:545
#12 amrex::launch<256, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, i
nt)#1}>(amrex::Box const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}>(int, unsigned long, ihipStream_t*, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:779
#13 _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.3d5caca8830a6260 () at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchGlobal.H:
16
Just want to confirm: is it the case that https://github.com/AMReX-Astro/Microphysics/pull/1422 and additional PRs will be needed to fully fix this?
That's the thinking. We won't know until we do it though. Of course, ROCm could also just fix their issues...
I really want ROCm 6.0 to be available for us to test with.
@BenWibking @zingale could I meanwhile try our Quokka simulation with #1422 as the Microphysics submodule (since we have ROCm 6.0 available)? I guess we would also need to make changes in Quokka and/or Microphysics CMakeLists?
I really want ROCm 6.0 to be available for us to test with.
We are still seeing the same memory error and crash that we were seeing before with ROCm 6.0, so something still appears to be wrong on their end.
the test_react
problem with subch_simple
works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.
the
test_react
problem withsubch_simple
works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.
unit_test/burn_cell
still reports the same false positive ASAN error with ROCm 6.0. It runs fine without ASAN, though.
The debug build is still linking...
I don't think we have any more instances of pure Microphysics tests failing with ROCm > 5.3.0 For Castro, we worked around an issue and Castro now runs with ROCm 6.0: https://github.com/AMReX-Astro/Castro/pull/2749
Previously tracked as https://github.com/AMReX-Codes/amrex/issues/3623.
Reproducer:
Error message:
According the Weiqun, the ASAN error is a false positive.
The {Castro, Quokka} production simulations crash and or produce error messages like this:
(See: https://github.com/AMReX-Astro/Castro/issues/2569 and https://github.com/quokka-astro/quokka/issues/447.)
In all cases, the errors are not seen on host-only builds or NVIDIA GPUs.