distributed training with horovod on multiple machines

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 371 forks source link

distributed training with horovod on multiple machines #427

Closed riyijiye closed 5 years ago

riyijiye commented 5 years ago

Hi,

below command for training with horovod provided in the instruction is for single machine multi-gpu mpiexec -np python run.py --config_file=... --mode=train_eval --use_horovod=True --enable_logs

I am wondering how to do distributed training with horovod using multiple machines each with several GPU cards.

thanks