aws / sagemaker-mxnet-training-toolkit

Toolkit for running MXNet training scripts on SageMaker. Dockerfiles used for building SageMaker MXNet Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
60 stars 55 forks source link

Add horovod/mpi support to generic container #180

Closed ChaiBapchya closed 4 years ago

ChaiBapchya commented 4 years ago

Currently, generic container is bare-bones & doesn't have setup for MPI/SSH/Horovod. Hence we are skipping horovod tests on mxnet.cpu [generic container].

That's the same case in TF https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L26-L29

https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L36-L42

ChaiBapchya commented 4 years ago

@ChuyangDeng as discussed offline. This functionality would need addition of openmpi/horovod related lines to generic container to make it run for horovod/mpi Refer: https://github.com/aws/deep-learning-containers/blob/master/mxnet/training/docker/1.6.0/py3/Dockerfile.gpu