fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
47 stars 38 forks source link

Add Umpire memory manager for GPU pool memory allocation #943

Open hokkanen opened 5 months ago

hokkanen commented 5 months ago

This PR adds Umpire memory manager for GPU pool memory allocation. However, the implementation crashes due to a silent error in the base version, see the below attached Zulip discussion. I mark this as draft, as it probably makes sense to fix the base version error first.

Zulip:

"Ok, I tried to figure out what is wrong, and it looks like the problem is not the Umpire implementation, but an already existing issue in the vlasiator_gpu branch, at least since 


commit f3bc0e44fcb0e763716784d3dcdfdc92f2ec20c7 (HEAD)

Author: Markus Battarbee <markus.battarbee@gmail.com>

Date:   Thu Mar 7 08:46:25 2024 +0200

    Comment out old prefetch

The reason the bug only shows up with the Umpire implementation is that the Managed class in gpu_base.hpp does not have error handling (ie, the gpuFree() just errors silently and execution continues):


// Unified memory class for inheritance

class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete(void *ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

   void* operator new[] (size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete[] (void* ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

};

If I add error handling, then the program fails exactly at the same location where Umpire implementation fails:


class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete(void *ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

   void* operator new[] (size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete[] (void* ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

};

with the following output (on Mahti):


(Grid) rank 0 is noderank 0 of 1

Done setting all 62 instances of device mesh wrapper handler!

(MAIN): Completed grid initialization.

(MAIN): Starting main simulation loop.

(MAIN): Completed requested simulation. Exiting.

driver shutting down in arch/gpu_base.hpp at line 90

srun: error: g1101: task 0: Exited with exit code 1

"

markusbattarbee commented 5 months ago

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?


cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80
markusbattarbee commented 5 months ago

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

kstppd commented 5 months ago

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

For Hashinator we would "just" need to add a new split allocator that uses Umpire.

hokkanen commented 5 months ago

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?

cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80

I think that should probably be ok. I didn't specify the CUDA architecture, but if it works, then you shouldn't need other stuff.

markusbattarbee commented 4 months ago

Myep, even after fixing those two calls it still complains on exit:

(Grid) rank 0 is noderank 0 of 1
Done setting all 64 instances of device mesh wrapper handler!
(MAIN): Completed grid initialization.
(MAIN): Starting main simulation loop.
(MAIN): Completed requested simulation. Exiting.
terminate called after throwing an instance of 'umpire::runtime_error'
  what():  ! Umpire runtime_error [/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/Umpire/src/umpire/util/AllocationMap.cpp:255]: Cannot remove 0x7ff453000000
    Backtrace: 13 frames
    0 0x617a92 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x617a92]
    1 0x61931b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x61931b]
    2 0x619948 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x619948]
    3 0x77c3be No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x77c3be]
    4 0x70d6ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x70d6ea]
    5 0x76050d No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x76050d]
    6 0x4b2a73 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x4b2a73]
    7 0x629373 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x629373]
    8 0x6294ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6294ea]
    9 0x6178b8 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6178b8]
    10 0x440d8b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x440d8b]
    11 0x7fffbe4c8cf3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fffbe4c8cf3]
    12 0x44d7ce No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x44d7ce]

[g1101:2996122] *** Process received signal ***
[g1101:2996122] Signal: Aborted (6)
[g1101:2996122] Signal code:  (-6)
[g1101:2996122] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7fffbe4dcb20]
[g1101:2996122] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fffbe4dca9f]
[g1101:2996122] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fffbe4afe05]
[g1101:2996122] [ 3] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xa27bc)[0x7fffbec787bc]
[g1101:2996122] [ 4] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad766)[0x7fffbec83766]
[g1101:2996122] [ 5] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad7d1)[0x7fffbec837d1]
[g1101:2996122] [ 6] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xada65)[0x7fffbec83a65]
[g1101:2996122] [ 7] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x77c537]
[g1101:2996122] [ 8] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x70d6ea]
[g1101:2996122] [ 9] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x76050d]
[g1101:2996122] [10] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x4b2a73]
[g1101:2996122] [11] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x629373]
[g1101:2996122] [12] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x6294ea]

The address 0x7ff453000000 looks like a GPU-memoryspace address to me.

Interestingly, as I was unable to debug this on Mahti, I then switched to my own desktop computer with a GTX1060. Built Umpire, compiled, run, and.... no error. :)

markusbattarbee commented 3 months ago

I notice now that the allocators constructed here do not use the syntax for umpire threadsafe allocators: https://umpire.readthedocs.io/en/develop/sphinx/cookbook/thread_safe.html Thus, we should either switch to a threadsafe allocator (which might be slow if it has to use locks on every allocation) or implement a method which creates max_omp_n_threads allocators where each CPU thread uses the assigned allocator. That'll probably be less efficient in re-coalescing allocations, but might still be the better option.