Refactor to allow multiple ranks per GPU with MPS

This PR makes some modifications to allow for multiple GPUs and multiple MPI ranks per GPU. There are essentially 5 different code execution paths intertwined together now (new paths in bold):

No MPI
No MPI w/ GPU
MPI
MPI w/ GPU (nranks/ngpu = 1)
MPI w/ GPU (nranks/ngpu > 1)

The code is working with up 4 ranks per GPU, although performance benefit after 2 ranks per GPU is negligible.

In order to work around some memory errors that were occurring during MPI communication, I implemented gpu_specter.util.gather_ndarray to gather multidimensional numpy arrays directly (without serialization) using a vector variant gather operation.

desihub / gpu_specter

Refactor to allow multiple ranks per GPU with MPS #42