the original CUDA support was limited to 1 device / rank
418 relaxed the constraint that at most 1 device could be assigned to each rank (but more than 1 device could be visible to each rank) to be able to coexist with components that require such support, e.g. the emerging device support in https://github.com/TESSEorg/TTG
For completeness need to be able to drive multiple devices from single rank. For performance reasons many algorithms may still benefit from 1 device/rank mapping to improve data locality/reuse
Status quo
418 relaxed the constraint that at most 1 device could be assigned to each rank (but more than 1 device could be visible to each rank) to be able to coexist with components that require such support, e.g. the emerging device support in https://github.com/TESSEorg/TTG
For completeness need to be able to drive multiple devices from single rank. For performance reasons many algorithms may still benefit from 1 device/rank mapping to improve data locality/reuse