Do not broadcast weights after all-reduce

The data parallel training algorithm implemented here uses all_reduce to get global weight updates (by summing and averaging them), as such the global weight updates become available to all the ranks, not only the root rank (which would be the case with reduce). There is no need to broadcast the global weights after each update iteration, but we need to broadcast the initial weights to all workers to ensure identical starting point for training.

PPPLDeepLearning / plasma-python

Do not broadcast weights after all-reduce #34