DifferentiableUniverseInitiative / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
https://eng.uber.com/horovod/
Other
0 stars 0 forks source link

Calling hvd.init() with lists of ranks #4

Closed kimchitsigai closed 3 years ago

kimchitsigai commented 3 years ago

Environment:

Framework: TensorFlow Framework version: 2.3.0 Horovod version: 0.21.3 MPI version: 4.0.2 CUDA version: 10.1.2 NCCL version: 2.7.8-1 Python version: 3.7.6 Spark / PySpark version: Ray version: OS and version: RHEL 8.1 GCC version: 7.3.0 CMake version: 3.18.0

Bug report:

I'm calling HorovodBasics.init(comm=[[0,1],[2,3]]) as it seemed to me that the code at https://github.com/DifferentiableUniverseInitiative/horovod/blob/multiple_communicators/horovod/common/basics.py#L68 was designed for that. And I get an exception at MPI._addressof() at https://github.com/DifferentiableUniverseInitiative/horovod/blob/multiple_communicators/horovod/common/basics.py#L76

Same exception with comm=[[0,1]]

Thanks a lot, Kimchi