Closed riyijiye closed 5 years ago
the docker image was built from this dockerfile https://github.com/horovod/horovod/blob/master/Dockerfile
Did you set num_gpus=1 in config file, when you switch to Horovod?
no, I did not. I thought num_gpus is not used at all in case of Horovod. I will retry, thanks! by the way, what will happen if num_gpus is not set to 1 in case of Horovod?
Then each worker will try to use num_gpus, and they will run out the memory
I set num_gpus=1 in config, but still ends up with the same error, any other suggestion?
Getting everything setup can be a bit tricky. NVIDIA publishes containers which have CUDA + TF +Horovod tested together. Could you please try the container from here: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow ?
I am using horovod in docker. when I config to run multi-gpu training without horovod, the training is ok without issue. when I config to turn on use_horovod (all other config are exactly the same), I ends up with below error at very beginning
E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 1073741824 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument
anyone can help? thanks!