Is the distributed training more than simple data parallelization?

Something noticeable when running ds2_large_8gpus.py with Horovod: with batch_size=2 on num_gpus=8 it produces better results than batch_size=16 on num_gpus=1. I should have dig into the codebase, but here I would like to ask is the distributed training more than just data parallelization? It should be... if so what's the magic?

PS: I used default parameters in ds2_large_8gpus.py, i.e., trained on all of LibriSpeech (train-clean-100 + train-clean-360 + train-other-500) etc, except that the batch_size is changed as above, and optimizer is changed to Adam with gradient clipping (no larc).

NVIDIA / OpenSeq2Seq

Is the distributed training more than simple data parallelization? #449