Something noticeable when running ds2_large_8gpus.py with Horovod: with batch_size=2 on num_gpus=8 it produces better results than batch_size=16 on num_gpus=1. I should have dig into the codebase, but here I would like to ask is the distributed training more than just data parallelization? It should be... if so what's the magic?
PS: I used default parameters in ds2_large_8gpus.py, i.e., trained on all of LibriSpeech (train-clean-100 + train-clean-360 + train-other-500) etc, except that the batch_size is changed as above, and optimizer is changed to Adam with gradient clipping (no larc).
Something noticeable when running
ds2_large_8gpus.py
with Horovod: withbatch_size=2
onnum_gpus=8
it produces better results thanbatch_size=16
onnum_gpus=1
. I should have dig into the codebase, but here I would like to ask is the distributed training more than just data parallelization? It should be... if so what's the magic?PS: I used default parameters in
ds2_large_8gpus.py
, i.e., trained on all of LibriSpeech (train-clean-100 + train-clean-360 + train-other-500) etc, except that the batch_size is changed as above, and optimizer is changed to Adam with gradient clipping (no larc).