DifferentiableUniverseInitiative / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
https://eng.uber.com/horovod/
Other
0 stars 0 forks source link

PR for Nccl fixes #6

Open EiffL opened 3 years ago

EiffL commented 3 years ago

This PR is meant to address the problems identified in #5

EiffL commented 3 years ago

@kimchitsigai I'm looking at your proposed modifications, but I don't understand the motivation for these changes:

  //message.set_device(device);
  message.set_device(horovod_global.controller[0]->GetRank());

Does this make a difference? I was thinking all the device should be 0 as they would refer to the GPU id that a given process can access, and each process can only see one devide.

kimchitsigai commented 3 years ago

@EiffL The device set in message.set_device() is used in the NCCLAlltoall::Execute function to populate the response.devices variable when calling nccl_op_context_.InitNCCLComm(entries, response.devices(), communicator_id); In InitNCCLComm, the response.devices is used as the device_map for sub-communicator.

kimchitsigai commented 3 years ago

Regarding the hvd.shutdown() line of code that I proposed to insert before exit(0) at the end of fft_benchmark.py file, I think this is useless. In basics.py, in the init() function, there is atexit.register(self.shutdown). So there must be another reason for the deadlocks.

kimchitsigai commented 3 years ago

There seems to be (sometimes) a deadlock scenario with 1 node / 4 GPUs, no nsys.

Device 0 belongs to subcommunicators [0,1] and [0,2] Device 1 belongs to subcommunicators [0,1] and [1,3] Device 2 belongs to subcommunicators [2,3] and [0,2] Device 3 belongs to subcommunicators [2,3] and [1,3].

  1. Process 0 is the first one to receive the shutdown signal from its fft_benchmark.py.
  2. It sends MPI messages on subcommunicator 2 ([0,1]). After process 0 receives the response, it starts its Background Thread shutdown process (one subcommunicator ready to shutdown is sufficient to stop the Background Thread).
  3. By coincidence, Process 1 gets its own shutdown signal from fft_benchmark just before entering ComputeResponseList for subcommunicator 2([0,1]). The result is that processes 0 and 1 shut down their Background Thread.
  4. At some time, between the reception of the responses and the end of the Background Thread of processes 0 and 1, process 2 receives the shutdown signal from its fft_benchmark.py.
  5. Process 2 then tries to send an MPI message on communicator 0 (by coincidence, ComputeResponseList is called for communicator 0 at this moment) which includes all the processes, but the MPI message is not sent, the MPI_Allreduce() never returns, as seen in the logs.