NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 371 forks source link

Is the distributed training more than simple data parallelization? #449

Closed wanglouis49 closed 5 years ago

wanglouis49 commented 5 years ago

Something noticeable when running ds2_large_8gpus.py with Horovod: with batch_size=2 on num_gpus=8 it produces better results than batch_size=16 on num_gpus=1. I should have dig into the codebase, but here I would like to ask is the distributed training more than just data parallelization? It should be... if so what's the magic?

PS: I used default parameters in ds2_large_8gpus.py, i.e., trained on all of LibriSpeech (train-clean-100 + train-clean-360 + train-other-500) etc, except that the batch_size is changed as above, and optimizer is changed to Adam with gradient clipping (no larc).