Closed ChaiBapchya closed 4 years ago
@ChuyangDeng as discussed offline. This functionality would need addition of openmpi/horovod related lines to generic container to make it run for horovod/mpi Refer: https://github.com/aws/deep-learning-containers/blob/master/mxnet/training/docker/1.6.0/py3/Dockerfile.gpu
Currently, generic container is bare-bones & doesn't have setup for MPI/SSH/Horovod. Hence we are skipping horovod tests on mxnet.cpu [generic container].
That's the same case in TF https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L26-L29
https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/a22e3df0faf66b215c24c1bff6f334e14c39d5cf/test/integration/local/test_horovod.py#L36-L42