Handling CUDA_VISIBLE_DEVICES

On some platforms (e.g. OLCF Summit), MPI ranks' visibility of GPUs is typically restricted with CUDA_VISIBLE_DEVICES. We currently require that all ranks be able to see all GPUs, so we can detect GPU distance, for example:

https://github.com/cwpearson/stencil/blob/6770d3cca578d79d67bf0ea38605c27936292199/include/stencil/partition.hpp#L710-L713

If all GPUs have ID 0, our GPU topology code will think all those GPUs are the same device, since according to a particular rank GPU0 is GPU0.

It may be possible to have the ranks report a UUID for each GPU instead of their CUDA id, and use that throughout to distinguish GPUs.

Once we can support this, we can allow users to tie CPU execution to CPUs with affinity for a particular GPU, which could improve performance.

cwpearson / stencil

Handling CUDA_VISIBLE_DEVICES #24