LLNL / Umpire

An application-focused API for memory management on NUMA & GPU architectures
MIT License
315 stars 51 forks source link

Runtime error when copying to constant device memory with HIP #912

Open gzagaris opened 4 days ago

gzagaris commented 4 days ago

Describe the bug

Copying data from the host to constant device memory using HIP throws a runtime error. The code works fine with CUDA , but, Umpire throws the following runtime error when compiled with HIP:

C++ exception with description "! Umpire runtime_error [.../Umpire/src/umpire/op/HipCopyOperation.cpp:31]: hipMemcpy( dest_ptr = 0xe3e480, src_ptr = 0x52fb90, length = 4096) failed with error: invalid argument

To Reproduce

Here is a code snippet that reproduces this behavior:

auto& rm = umpire::ResourceManager::getInstance();
auto const_allocator = rm.getAllocator("DEVICE_CONST");

static constexpr int N = 4;
static constexpr int BYTESIZE = N * sizeof(int);
static constexpr int TEST_VAL = 42;

auto host_allocator = rm.getAllocator("HOST");
int* HOST_DATA = static_cast< int* >(host_allocator.allocate(BYTESIZE));
for ( int i = 0; i < N; ++i ) {
    HOST_DATA[ i ] = TEST_VAL;
}

int* A_d = static_cast< int* >(const_allocator.allocate(BYTESIZE));
EXPECT_TRUE(A_d != nullptr);
rm.copy(A_d, HOST_DATA, BYTESIZE); // <------------- RUNTIME ERROR THROWN HERE!

I am compiling Umpire with -DENABLE_HIP=On -DUMPIRE_ENABLE_DEVICE_CONST=On.

Am I missing anything?

Expected behavior

I would have expected this to work and not throw a runtime error.

Compilers & Libraries (please complete the following information):

gzagaris commented 4 days ago

One more thing to add to this. I did look into the implementation a bit and I did not see any calls to hipGetSymbolAddress() in the HipConstantMemoryResource.cpp, which I would have expected and it may be related to the issue that I am seeing.

gzagaris commented 4 days ago

Following a similar approach as in CudaConstantMemoryResource.cu, I made the following changes to HipConstantMemoryResource.cpp:

diff --git a/src/umpire/resource/HipConstantMemoryResource.cpp b/src/umpire/resource/HipConstantMemoryResource.cpp
index 65c5f72c..d00103a2 100644
--- a/src/umpire/resource/HipConstantMemoryResource.cpp
+++ b/src/umpire/resource/HipConstantMemoryResource.cpp
@@ -25,7 +25,8 @@ HipConstantMemoryResource::HipConstantMemoryResource(const std::string& name, in
       m_highwatermark{0},
       m_platform{Platform::hip},
       m_offset{0},
-      m_ptr{s_umpire_internal_device_constant_memory}
+      m_ptr{nullptr},
+      m_initialized{false}
 {
 }

@@ -33,6 +34,16 @@ void* HipConstantMemoryResource::allocate(std::size_t bytes)
 {
   std::lock_guard<std::mutex> lock{m_mutex};

+  if (!m_initialized) {
+    hipError_t error = ::hipGetSymbolAddress((void**)&m_ptr, s_umpire_internal_device_constant_memory);
+
+    if (error != hipSuccess) {
+      UMPIRE_ERROR(runtime_error, umpire::fmt::format("hipGetSymbolAddress failed with error: {}", ::hipGetErrorString(error)));
+    }
+
+    m_initialized = true;
+  }
+
   char* ptr{static_cast<char*>(m_ptr) + m_offset};
   m_offset += bytes;

diff --git a/src/umpire/resource/HipConstantMemoryResource.hpp b/src/umpire/resource/HipConstantMemoryResource.hpp
index d7afac23..5e32f45f 100644
--- a/src/umpire/resource/HipConstantMemoryResource.hpp
+++ b/src/umpire/resource/HipConstantMemoryResource.hpp
@@ -39,7 +39,8 @@ class HipConstantMemoryResource : public MemoryResource {

   std::size_t m_offset;
   void* m_ptr;
-
+  bool m_initialized;
+
   std::mutex m_mutex;
 };

I didn't have any luck with that though, hipGetSymbolAddress, now throws the following runtime error:

terminate called after throwing an instance of 'umpire::runtime_error'
45:   what():  ! Umpire runtime_error [.../Umpire/src/umpire/resource/HipConstantMemoryResource.cpp:41]: hipGetSymbolAddress failed with error: invalid device symbol