multi-gpu training with horovod

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 371 forks source link

multi-gpu training with horovod #428

Closed riyijiye closed 5 years ago

riyijiye commented 5 years ago

I am using horovod in docker. when I config to run multi-gpu training without horovod, the training is ok without issue. when I config to turn on use_horovod (all other config are exactly the same), I ends up with below error at very beginning

E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 1073741824 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument

anyone can help? thanks!

riyijiye commented 5 years ago

the docker image was built from this dockerfile https://github.com/horovod/horovod/blob/master/Dockerfile

borisgin commented 5 years ago

Did you set num_gpus=1 in config file, when you switch to Horovod?

riyijiye commented 5 years ago

no, I did not. I thought num_gpus is not used at all in case of Horovod. I will retry, thanks! by the way, what will happen if num_gpus is not set to 1 in case of Horovod?

borisgin commented 5 years ago

Then each worker will try to use num_gpus, and they will run out the memory

riyijiye commented 5 years ago

I set num_gpus=1 in config, but still ends up with the same error, any other suggestion?

okuchaiev commented 5 years ago

Getting everything setup can be a bit tricky. NVIDIA publishes containers which have CUDA + TF +Horovod tested together. Could you please try the container from here: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow ?