bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Bytescheduler global barrier in Tensorflow and Pytorch #381

Open offthewall123 opened 3 years ago

offthewall123 commented 3 years ago

In paper https://i.cs.hku.hk/~cwu/papers/yhpeng-sosp19.pdf, there mentioned a concept global barrier in Tensorflow between successive iterations. – the global barrier waits for all communication operations to finish before moving on to the next iteration

But not found any discuss on global barrier in tensorflow or pytorch, want to make sure that is there really a global barrier in Tensorflow?And do we have some code reference for it?

bobzhuyb commented 3 years ago

These are the "global barriers":

TF: optimizer.apply_gradients()
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py#L539

PyTorch: optimizer.step(), e.g., SGD's https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L77

"global barrier" is a conceptual name. It just means that the framework would synchronize all the communication before moving on to the forward propagation in the next iteration.