AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
506 stars 337 forks source link

changes since 24.04 break HIP on Frontier #3901

Closed zingale closed 2 months ago

zingale commented 2 months ago

When running Castro on Frontier with HIP with the latest AMReX, I get:

Initializing AMReX (24.04-22-g64d2360b209c)...
MPI initialized with 256 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 256 devices.
Memory access fault by GPU node-10 (Agent handle: 0x68af1d0) on address 0x1334000. Reason: Unknown.
SIGABRT
Memory access fault by GPU node-9 (Agent handle: 0x68af1d0) on address 0x1334000. Reason: Unknown.
SIGABRT
Memory access fault by GPU node-11 (Agent handle: 0x68af1d0) on address 0x1334000. Reason: Unknown.

I've I drop back to 24.04, then things work fine.

Looking at the recent changes, I suspect #3881 to be the bug. I'll try bisecting

zingale commented 2 months ago

I can confirm that I run on Frontier if I don't include #3881

WeiqunZhang commented 2 months ago

Does it run if you comment out this line? https://github.com/AMReX-Codes/amrex/blob/64d2360b209c1e625bfc9a6feeb7501a618f7ed5/Src/Base/AMReX_GpuDevice.cpp#L326

WeiqunZhang commented 2 months ago

If that is the case, it looks like a compiler bug. We don't actually need that call. It was there so that the compiler does not warn about amrex_check_wavefront_size being an unused function. We could probably fix it by

diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
index 6f972040e1..e129068d4c 100644
--- a/Src/Base/AMReX_GpuDevice.cpp
+++ b/Src/Base/AMReX_GpuDevice.cpp
@@ -323,7 +323,9 @@ Device::Initialize ()
 #endif

 #if defined(AMREX_USE_HIP)
-    amrex::single_task(amrex_check_wavefront_size);
+    if (num_devices_used < 0) {
+        amrex::single_task(amrex_check_wavefront_size);
+    }
 #endif

     Device::profilerStart();

@zingale If this works, could you submit a PR so that we can get it fixed soon?

WeiqunZhang commented 2 months ago

We will also need #3897 merged first to pass CI.

zingale commented 2 months ago

okay, indeed, commenting out that line fixes the issue

zingale commented 2 months ago

okay, it works with your suggested fix. Thanks @WeiqunZhang . PR issued