bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

[Asynchronous] Why is asynchronous slower than synchronous? #271

Open idoh opened 4 years ago

idoh commented 4 years ago

After the BytePS benchmark I found that asynchronous training was slower than synchronous training: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

The asynchronous training was around 144 images/sec and the synchronous training was around 176 images/sec. In both cases, I followed the instructions in the distributed section.

Setup I used an 8 x RTX 2080ti server with 64 CPU threads. I used the latest BytePS-PyTorch docker images. I increased the thread count of the parameter server to 32 and for the scheduler to 16 to see if they were the bottleneck.

Expected behavior I expected the asynchronous training to be as fast if not faster than the synchronous training.

ymjiang commented 4 years ago

Can you share your detailed setup? For example, how many workers & servers do you use?

One reason I can think of is that the asynchronous design of BytePS involves extra memory copy before sending out the tensors, while the synchronous implementation does not have such copy. If the speed of all workers are similar and the network is pretty fast, the copy overhead might dominates. So you won't see asynchronous benefits in such cases.

idoh commented 4 years ago

Thanks for the explanation and fast response. I'm using a single machine with 2 workers, 1 parameter server and 1 scheduler on it. I know that training it locally would be better but I wanted to try the distributed setup before launching on real servers. I'm using the BytePS ResNet-50 benchmark script: byteps/example/pytorch/benchmark_byteps.py. I trained both synchronous and asynchronous in this setup.

I don't understand what additional copy overhead does the asynchronous training have if both are trained distributed?

ymjiang commented 4 years ago

See the memory copy operation in pytorch: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L191. (we also have similar implementation for TF and MXNet)

As I said, you may expect to see the asynchronous advantages when using a real distributed setup, where the workers have distinct training speed.

idoh commented 3 years ago

After a bit of digging, I believe the problem is that for asynchronous training the communications don't overlap with the backward computations: https://github.com/bytedance/byteps/blob/master/byteps/torch/__init__.py#L132

After changing the asynchronous training to send gradients in the backward pass, it is slightly faster training than synchronous.