This is a first pass to offload redundancy encoding to the GPU.
MPI applications running on systems with GPUs often run a single rank per GPU. This means that only a small number of ranks are available for encoding on each node. To improve performance, this executes the encode logic on the GPU. The current implementation requires CUDA-enabled MPI, because intermediate buffers are sent and received directly from the GPU without copies to the host for higher performance.
This adds a new -DENABLE_CUDA=ON CMake option to compile with CUDA support. It requires nvcc to be detectable by CMake.
The performance improvement is notable for both XOR and RS. Running 4 procs/node, where each writes a 1GB checkpoint file, the encode time is reduced by about 20x using Nvidia V100s compared to using the CPU.
The changes support both encode and (scalable) decode for XOR and RS.
The RS decode implementation could likely be improved by moving the full gaussian solve to a kernel to reduce the number of kernel launches. However, at least the current version is functional.
For a multiply, the kernel does a lookup to get the log of the two operands and then another lookup to exponentiate the sum of the logs. This requires 3 memory loads. For a 1024-thread block, this requires 3*1024 memory loads.
To scale a set of values by a constant when using GF(2^8), things could be improved by precomputing the full 256-element multiplication table so that a multiply could be done using a single memory load. This table could be loaded into CUDA shared memory. That would require 256 memory loads.
Adding nvcc to github actions seems a bit complicated
Got to be an easier way, right? Why doesn't Nvidia maintain something if it's this hard? Good question to pose to Nvidia.
IBM MPI -pthread work around for nvcc
The IBM MPI compiler wrappers add -pthread, which leads to a fatal error with nvcc. As a work around, this flag can be dropped with a search/replace as done in:
An alternative to GPU offloading would be to spawn threads on each MPI process and use more CPU cores. That would require either MPI_THREAD_MULTIPLE or at least thread synchronization when executing MPI operations. The benefit with this is that it would not require memory on the GPU, and it could be used on systems where people are not using all cores (for some reason) but have no GPUs.
This is a first pass to offload redundancy encoding to the GPU.
MPI applications running on systems with GPUs often run a single rank per GPU. This means that only a small number of ranks are available for encoding on each node. To improve performance, this executes the encode logic on the GPU. The current implementation requires CUDA-enabled MPI, because intermediate buffers are sent and received directly from the GPU without copies to the host for higher performance.
This adds a new
-DENABLE_CUDA=ON
CMake option to compile with CUDA support. It requiresnvcc
to be detectable by CMake.The performance improvement is notable for both
XOR
andRS
. Running 4 procs/node, where each writes a 1GB checkpoint file, the encode time is reduced by about 20x using Nvidia V100s compared to using the CPU.The changes support both encode and (scalable) decode for
XOR
andRS
.The
RS
decode implementation could likely be improved by moving the full gaussian solve to a kernel to reduce the number of kernel launches. However, at least the current version is functional.For a multiply, the kernel does a lookup to get the log of the two operands and then another lookup to exponentiate the sum of the logs. This requires 3 memory loads. For a 1024-thread block, this requires 3*1024 memory loads.
To scale a set of values by a constant when using GF(2^8), things could be improved by precomputing the full 256-element multiplication table so that a multiply could be done using a single memory load. This table could be loaded into CUDA shared memory. That would require 256 memory loads.
Adding nvcc to github actions seems a bit complicated
https://github.com/ptheywood/cuda-cmake-github-actions
Got to be an easier way, right? Why doesn't Nvidia maintain something if it's this hard? Good question to pose to Nvidia.
IBM MPI -pthread work around for nvcc
The IBM MPI compiler wrappers add
-pthread
, which leads to a fatal error withnvcc
. As a work around, this flag can be dropped with a search/replace as done in:https://github.com/LLNL/blt/blob/aea5fbf046e122bd72888dad0a7f97a07b9ff08d/cmake/thirdparty/SetupMPI.cmake#L111-L119
A cleaner work around might be to replace
-pthread
with-Xcompiler -pthread
when building with CUDA: https://stackoverflow.com/questions/43911802/does-nvcc-support-pthread-option-internallyAlternative implementation
An alternative to GPU offloading would be to spawn threads on each MPI process and use more CPU cores. That would require either
MPI_THREAD_MULTIPLE
or at least thread synchronization when executing MPI operations. The benefit with this is that it would not require memory on the GPU, and it could be used on systems where people are not using all cores (for some reason) but have no GPUs.