Open hokkanen opened 5 months ago
I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?
cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80
Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.
Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.
Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.
Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.
For Hashinator we would "just" need to add a new split allocator that uses Umpire.
I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?
cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80
I think that should probably be ok. I didn't specify the CUDA architecture, but if it works, then you shouldn't need other stuff.
Myep, even after fixing those two calls it still complains on exit:
(Grid) rank 0 is noderank 0 of 1
Done setting all 64 instances of device mesh wrapper handler!
(MAIN): Completed grid initialization.
(MAIN): Starting main simulation loop.
(MAIN): Completed requested simulation. Exiting.
terminate called after throwing an instance of 'umpire::runtime_error'
what(): ! Umpire runtime_error [/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/Umpire/src/umpire/util/AllocationMap.cpp:255]: Cannot remove 0x7ff453000000
Backtrace: 13 frames
0 0x617a92 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x617a92]
1 0x61931b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x61931b]
2 0x619948 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x619948]
3 0x77c3be No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x77c3be]
4 0x70d6ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x70d6ea]
5 0x76050d No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x76050d]
6 0x4b2a73 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x4b2a73]
7 0x629373 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x629373]
8 0x6294ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6294ea]
9 0x6178b8 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6178b8]
10 0x440d8b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x440d8b]
11 0x7fffbe4c8cf3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fffbe4c8cf3]
12 0x44d7ce No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x44d7ce]
[g1101:2996122] *** Process received signal ***
[g1101:2996122] Signal: Aborted (6)
[g1101:2996122] Signal code: (-6)
[g1101:2996122] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7fffbe4dcb20]
[g1101:2996122] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fffbe4dca9f]
[g1101:2996122] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fffbe4afe05]
[g1101:2996122] [ 3] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xa27bc)[0x7fffbec787bc]
[g1101:2996122] [ 4] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad766)[0x7fffbec83766]
[g1101:2996122] [ 5] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad7d1)[0x7fffbec837d1]
[g1101:2996122] [ 6] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xada65)[0x7fffbec83a65]
[g1101:2996122] [ 7] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x77c537]
[g1101:2996122] [ 8] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x70d6ea]
[g1101:2996122] [ 9] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x76050d]
[g1101:2996122] [10] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x4b2a73]
[g1101:2996122] [11] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x629373]
[g1101:2996122] [12] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x6294ea]
The address 0x7ff453000000
looks like a GPU-memoryspace address to me.
Interestingly, as I was unable to debug this on Mahti, I then switched to my own desktop computer with a GTX1060. Built Umpire, compiled, run, and.... no error. :)
I notice now that the allocators constructed here do not use the syntax for umpire threadsafe allocators:
https://umpire.readthedocs.io/en/develop/sphinx/cookbook/thread_safe.html
Thus, we should either switch to a threadsafe allocator (which might be slow if it has to use locks on every allocation) or implement a method which creates max_omp_n_threads
allocators where each CPU thread uses the assigned allocator. That'll probably be less efficient in re-coalescing allocations, but might still be the better option.
This PR adds Umpire memory manager for GPU pool memory allocation. However, the implementation crashes due to a silent error in the base version, see the below attached Zulip discussion. I mark this as draft, as it probably makes sense to fix the base version error first.
Zulip:
"Ok, I tried to figure out what is wrong, and it looks like the problem is not the Umpire implementation, but an already existing issue in the vlasiator_gpu branch, at least since
The reason the bug only shows up with the Umpire implementation is that the
Managed
class ingpu_base.hpp
does not have error handling (ie, the gpuFree() just errors silently and execution continues):If I add error handling, then the program fails exactly at the same location where Umpire implementation fails:
with the following output (on Mahti):
"