Putting this in a pull request in case it is worth incorporating here. I'm using Charm4py to distribute tasks to GPUs on a distributed memory machine. If each PE is assigned a GPU, then the GPU assigned to PE 0 is left idle. This scheduler avoids that problem by having one PE be idle on each host. Users can then run with Num_GPUs + 1 PEs per host to use all of the GPUs on each host while leaving PE 0 for scheduling.
Putting this in a pull request in case it is worth incorporating here. I'm using Charm4py to distribute tasks to GPUs on a distributed memory machine. If each PE is assigned a GPU, then the GPU assigned to PE 0 is left idle. This scheduler avoids that problem by having one PE be idle on each host. Users can then run with Num_GPUs + 1 PEs per host to use all of the GPUs on each host while leaving PE 0 for scheduling.