DifferentiableUniverseInitiative / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
https://eng.uber.com/horovod/
Other
0 stars 0 forks source link

Problem when calling hvd.init() with a list of ranks #3

Closed kimchitsigai closed 3 years ago

kimchitsigai commented 3 years ago

Environment:

  1. Framework: TensorFlow
  2. Framework version: 2.3.0
  3. Horovod version: 0.21.3
  4. MPI version: 4.0.2
  5. CUDA version: 10.1.2
  6. NCCL version: 2.7.8-1
  7. Python version: 3.7.6
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: RHEL 8.1
  11. GCC version: 7.3.0
  12. CMake version: 3.18.0

Bug report: I'm calling HorovodBasics.init(comm=[[0,1],[2,3]]) as it seemed to me that the code at https://github.com/DifferentiableUniverseInitiative/horovod/blob/multiple_communicators/horovod/common/basics.py#L68 was designed for that. And I get an exception at MPI._addressof() at https://github.com/DifferentiableUniverseInitiative/horovod/blob/multiple_communicators/horovod/common/basics.py#L76

Same exception with comm=[[0,1]]

Thanks a lot, Kimchi

kimchitsigai commented 3 years ago

Sorry, I was in the wrong branch.