Performance slightly lower than horovod

bengineml commented 4 years ago

Describe the bug We've been doing some benchmarking of horovod vs byte-ps on AWS. We were hoping to see some performance improvements from using byteps for 64 GPU jobs. We've noticed that byte-ps has about the same performance for us, perhaps even a few % lower.

To Reproduce Training a FasterRCNN model from open-mmlab on the open-images dataset on AWS. Horovod setup: 8x p3.16xlarge workers Byte-ps setup: 8x p3.16xlarge workers, 8x c5n.4xlarge servers with the following env-vars:

MXNET_OMP_MAX_THREADS=10
MXNET_CPU_WORKER_NTHREADS=32

Expected behavior We expected to see a performance increase from using byte-ps over using horovod. Are there some other tweaks we can apply to try and increase the speed of byte-ps? Perhaps the model-training code itself is the bottleneck and won't benefit from better distributed training infrastructure.

Right now we're seeing the following time/step

Horovod: approx 0.591 s / step
Byte-PS: approx 0.615 s / step

Environment (please complete the following information):

OS: ubuntu 18.04
GCC version: 4.9
CUDA and NCCL version: CUDA 10 and NCCL 2.4.8
Framework (TF, PyTorch, MXNet): pytorch 1.1.0

bobzhuyb commented 4 years ago

Hello, would you tell us which version of byteps are you using? Are you using the latest master branch / latest docker image or an early version, like v0.1 tag (https://github.com/bytedance/byteps/tree/v0.1) or even earlier? If you are using something earlier, we strongly recommend that you move to the v0.1 or master, because they would had better performance. If you use these recent versions, you can get rid of the two env's you mention since BytePS no longer relies on MXNet.

After the above, You may try increasing the number of server instances, e.g., start two server instances per c5n.4xlarge. The reason is that TCP has difficulties saturating your 25Gbps network bandwidth, and more server instances mean more TCP connections and hopefully you'll get better performance.

bengineml commented 4 years ago

Thanks for the pointers. The numbers in my previous comment were from an older version of byte-ps. After upgrading to a more recent version (commit bb91039) and running two servers per c5n.xlarge, I'm getting better performance, but still about even with Horovod.

Setup	Seconds / Step
Horovod	≈0.591
Old BytePS	≈0.615
New BytePS, 2x servers	≈0.590

bobzhuyb commented 4 years ago

@bengineml It seems that cross-worker communication is not a bottleneck. Have you done any profiling?

We can also do a quick estimation for the communication overhead. Would you tell us the below numbers:

the training speed on one single V100, using the same per-GPU batch size
the training speed on one p3.16xlarge with 8 V100's, using the same per-GPU batch size
the total size of the model (number of parameters * 2 or 4, depending on whether you are using fp16 or fp32)

Firstly, we know that you can never run faster than the numbers in #1 and #2. With #3, we can approximate the communication time by (the total size of the model / 25Gbps bandwidth of p3.16xlarge). If this communication time is much less than 100ms, you may not see much improvement by using a different distributed framework.

bytedance / byteps

Performance slightly lower than horovod #188