Closed BenWibking closed 11 months ago
For reference, both Castro and Quokka crash with memory errors like this in production sims with ROCm 5.7.0:
Memory access fault by GPU node-8 (Agent handle: 0x2975b60) on address 0x800033773000. Reason: Unknown.
Following @zingale's suggestion, I commented out the line that calls amrex::InitRandom
, and the tests run without any errors from ASAN (other than unrelated memory leaks).
diff --git a/Src/Base/AMReX.cpp b/Src/Base/AMReX.cpp
index 4449dab19..0d6fe8138 100644
--- a/Src/Base/AMReX.cpp
+++ b/Src/Base/AMReX.cpp
@@ -622,7 +622,7 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
//
// Initialize random seed after we're running in parallel.
//
- amrex::InitRandom(ParallelDescriptor::MyProc()+1, ParallelDescriptor::NProcs());
+ //amrex::InitRandom(ParallelDescriptor::MyProc()+1, ParallelDescriptor::NProcs());
// For thread safety, we should do these initializations here.
BaseFab_Initialize();
I could not see anything wrong with the code in amrex::InitRandom
. This might be a false positive or a bug in xnack.
This is very puzzling. Is there a way to manually examine HIP device symbols in the binary, like nm
for host code?
I'll close this and move it to Microphysics.
AMDGPU ASAN (included in ROCm 5.7+) reports a buffer overflow triggered within
amrex::InitRandom
at this GPU kernel: https://github.com/AMReX-Codes/amrex/blob/d36463103daed09a40cdea235041a6ab79ff280c/Src/Base/AMReX_Random.cpp#L54The above can be reproduced with:
This should compile and link in < 5 minutes.
I can reproduce this with both the Microphysics and Quokka unit tests, but (very strangely) not with the AMReX tests. Both codes have experienced crashes with strange memory errors with ROCm 5.7 that might be related to this.
@WeiqunZhang @zingale