Open EiffL opened 3 years ago
@kimchitsigai I'm looking at your proposed modifications, but I don't understand the motivation for these changes:
//message.set_device(device);
 message.set_device(horovod_global.controller[0]->GetRank());
Does this make a difference? I was thinking all the device
should be 0 as they would refer to the GPU id that a given process can access, and each process can only see one devide.
@EiffL The device set in message.set_device() is used in the NCCLAlltoall::Execute
function to populate the response.devices variable when calling nccl_op_context_.InitNCCLComm(entries, response.devices(), communicator_id);
In InitNCCLComm, the response.devices is used as the device_map for sub-communicator.
Regarding the hvd.shutdown()
line of code that I proposed to insert before exit(0) at the end of fft_benchmark.py file, I think this is useless.
In basics.py, in the init() function, there is atexit.register(self.shutdown)
. So there must be another reason for the deadlocks.
There seems to be (sometimes) a deadlock scenario with 1 node / 4 GPUs, no nsys.
Device 0 belongs to subcommunicators [0,1] and [0,2] Device 1 belongs to subcommunicators [0,1] and [1,3] Device 2 belongs to subcommunicators [2,3] and [0,2] Device 3 belongs to subcommunicators [2,3] and [1,3].
This PR is meant to address the problems identified in #5