xor: encode with cuda - Githubissues

This is a first pass to offload redundancy encoding to the GPU.

MPI applications running on systems with GPUs often run a single rank per GPU. This means that only a small number of ranks are available for encoding on each node. To improve performance, this executes the encode logic on the GPU. The current implementation requires CUDA-enabled MPI, because intermediate buffers are sent and received directly from the GPU without copies to the host for higher performance.

This adds a new -DENABLE_CUDA=ON CMake option to compile with CUDA support. It requires nvcc to be detectable by CMake.

The performance improvement is notable for both XOR and RS. Running 4 procs/node, where each writes a 1GB checkpoint file, the encode time is reduced by about 20x using Nvidia V100s compared to using the CPU.

The changes support both encode and (scalable) decode for XOR and RS.

The RS decode implementation could likely be improved by moving the full gaussian solve to a kernel to reduce the number of kernel launches. However, at least the current version is functional.

For a multiply, the kernel does a lookup to get the log of the two operands and then another lookup to exponentiate the sum of the logs. This requires 3 memory loads. For a 1024-thread block, this requires 3*1024 memory loads.

To scale a set of values by a constant when using GF(2^8), things could be improved by precomputing the full 256-element multiplication table so that a multiply could be done using a single memory load. This table could be loaded into CUDA shared memory. That would require 256 memory loads.

Adding nvcc to github actions seems a bit complicated

https://github.com/ptheywood/cuda-cmake-github-actions

Got to be an easier way, right? Why doesn't Nvidia maintain something if it's this hard? Good question to pose to Nvidia.

IBM MPI -pthread work around for nvcc

The IBM MPI compiler wrappers add -pthread, which leads to a fatal error with nvcc. As a work around, this flag can be dropped with a search/replace as done in:

https://github.com/LLNL/blt/blob/aea5fbf046e122bd72888dad0a7f97a07b9ff08d/cmake/thirdparty/SetupMPI.cmake#L111-L119

A cleaner work around might be to replace -pthread with -Xcompiler -pthread when building with CUDA: https://stackoverflow.com/questions/43911802/does-nvcc-support-pthread-option-internally

Alternative implementation

An alternative to GPU offloading would be to spawn threads on each MPI process and use more CPU cores. That would require either MPI_THREAD_MULTIPLE or at least thread synchronization when executing MPI operations. The benefit with this is that it would not require memory on the GPU, and it could be used on systems where people are not using all cores (for some reason) but have no GPUs.

ECP-VeloC / redset

xor: encode with cuda #51