This PR makes some modifications to allow for multiple GPUs and multiple MPI ranks per GPU. There are essentially 5 different code execution paths intertwined together now (new paths in bold):
No MPI
No MPI w/ GPU
MPI
MPI w/ GPU (nranks/ngpu = 1)
MPI w/ GPU (nranks/ngpu > 1)
The code is working with up 4 ranks per GPU, although performance benefit after 2 ranks per GPU is negligible.
In order to work around some memory errors that were occurring during MPI communication, I implemented gpu_specter.util.gather_ndarray to gather multidimensional numpy arrays directly (without serialization) using a vector variant gather operation.
The most recent commit fixes the performance issue when running with 1 GPU + MPI. Also added a few comments to code that determines MPI/GPU division of labor and communication strategy.
This PR makes some modifications to allow for multiple GPUs and multiple MPI ranks per GPU. There are essentially 5 different code execution paths intertwined together now (new paths in bold):
The code is working with up 4 ranks per GPU, although performance benefit after 2 ranks per GPU is negligible.
In order to work around some memory errors that were occurring during MPI communication, I implemented
gpu_specter.util.gather_ndarray
to gather multidimensional numpy arrays directly (without serialization) using a vector variant gather operation.