DifferentiableUniverseInitiative / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
https://eng.uber.com/horovod/
Other
0 stars 0 forks source link

Draft PR for multi-communicator support #2

Open EiffL opened 3 years ago

EiffL commented 3 years ago

I'm opening this draft PR to facilitate reviewing/comments on our efforts to integrate support for multiple communicators in Horovod. This is heavily inspired from your code @kimchitsigai and @mypey (in branches comms-idris-MP and comms-idris-JS), I've made a few clean ups and proposed a few simplifications, trying to follow the recommendations in https://github.com/horovod/horovod/issues/2139 .

Here is a summary of the approach:

Most of the code modifications is to turn references to global_state.controller to global_state.controller[communicator_id] where the communicator_id is usually accessible through the message or request in internal functions. I've also assumed that we'll try to use global_state.controller[0] for COMM_WORLD, but nothing enforces it so far.

Note that the implementation is not yet complete, for now I haven't bothered with the adasum operations, and they are ignoring the multiple potential communicators, only using the one with index 0. Also, only a few tensorflow python operations support specifying which communicator to use for now.

Open questions

Mostly about whether to duplicate some global state variables. Remember that each controller produces a response list that is then processed BEFORE running the next controller. So, if we assume that one horovod loop leave these variables in a clean state, it should be ok not to duplicate them:

Next questions are about how optimized this approach is:

Example

Here is how one would use this:

from mpi4py import MPI

# This will be our baseline world communicator
comm = MPI.COMM_WORLD
# Split COMM_WORLD into subcommunicators
subcomm = MPI.COMM_WORLD.Split(color=MPI.COMM_WORLD.rank % 2,
                               key=MPI.COMM_WORLD.rank)

# And here is our array of communicators
comms = [comm, subcomm]

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init(comm=comms)

# Let's try to operate on some tensors
a = (r+1)*tf.ones(10)

print("AlltoAll on WORLD", hvd.alltoall(a, communicator_id=0))

print("AlltoAll on sub communicator", hvd.alltoall(a, communicator_id=1))

To run this for instance on my little machine with 2 GPUs:

$ horovodrun -np 2  --timeline-filename my_timeline.json --timeline-mark-cycles python test_hvd.py

Note this outputs a timeline, where you can see if it goes through the NCCL ops or not (more info here ).

To compile with NCCL and MPI support I'm using the following command line:

$ HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_PYTORCH=1 python setup.py develop --user