Much slower than Horovod on TCP

azuresol commented 4 years ago

Describe the bug I benchmarked BytePS and Horovod's performance using this script using 4VM * 8 V100 on TCP. It turned out that the performance I got from BytePS is much lower than Horovod. I wonder if this is expected?

Horovod (4 worker VMs):

# Message size (MB) / Throughput (MB/s)
{ 64: 628.1127256680329,
  128: 1637.494250997945,
  256: 1606.363227578568,
  512: 2162.496918009337,
  1024: 2210.063743908007}

BytePS (4 worker VMs + 4 server VMs):

{ 64: 188.61641825541528,
  128: 206.0769111278037,
  256: 216.58532511299617,
  512: 216.87162749269987,
  1024: 224.94936582978187}

I also ran BytePS on a single worker setting (without server and scheduler) in order to pinpoint the issue, but the result is also lower than expected:

{ 64: 665.3096975939151,
  128: 768.8773438853825,
  256: 787.3222907256899,
  512: 772.3518927654511,
  1024: 768.8109970613676}

Environment (please complete the following information):

OS: Debian 9
GCC version: GCC 7
CUDA and NCCL version: CUDA 10 and NCCL 2.5.6
Framework (TF, PyTorch, MXNet): TF 1.15

Additional context The VM type is n1-highmem-96 on GCP with 100G networking. GPUs are connected by NVLinks (similar to DGX-1).

ymjiang commented 4 years ago

Your benchmark code sends tensors one by one. Once BytePS saws a new tensor, it requires expensive initialization of that tensor buffer. So your performance is expected. We recommend using the end-to-end benchmark to compare the performance.

Here are some other tuning tips, which should be helpful for TCP: https://github.com/bytedance/byteps/issues/230#issuecomment-611515746.

bobzhuyb commented 4 years ago

@azuresol As @ymjiang said, for each tensor, when you send for the first time, BytePS would need additional time for initializing it on both the worker side and server side. Horovod does not need this because all-reduce is stateless (while PS is stateful). However, this is just for the first round.

In addition to what @ymjiang suggests, you may also refactor your code to push_pull the same tensors for multiple times, and measure the time for the second or later rounds.

azuresol commented 4 years ago

Thanks, I will try again with the refactored benchmark. Does BytePS need expensive initialization even without servers? I assumed it was equivalent to plain NCCL operations, but I might be misunderstanding.

azuresol commented 4 years ago

A quick update: so I added a round of session.run() before measuring the actual execution. The performance number of BytePS improved, but is still lower than Horovod. Any ideas?

Horovod:
{ 64: 1776.7572510191353,
  128: 1720.4674884248109,
  256: 2045.2847935221362,
  512: 2121.0158884188995,
  1024: 2309.7685065791807}

BytePS:
{ 64: 1296.940501910737,
  128: 1457.7873321558939,
  256: 1568.298796971928,
  512: 1618.0508589138265,
  1024: 1677.4659052153513}

bobzhuyb commented 4 years ago

@azuresol Can you increase the number of server processes? You can keep using those 4 server VMs, but start more server processes on each VM.

For TCP, a single connection cannot saturate the 100Gbps bandwidth. NCCL establishes multiple rings to overcome that. Correspondingly, You need more server instances for BytePS.

Anyways, it does not make too much sense running TCP on 100Gbps networks, though it's a shame that some cloud providers do not offer RDMA on high speed network. The BytePS support for AWS EFA is under development, and we are also looking at ways to optimize BytePS TCP. It may take some time though.

azuresol commented 4 years ago

Hi @bobzhuyb, agree that RDMA could offer better performance, but it is still interesting to find out why BytePS underperforms on TCP -- IIUC, given the same bus bandwidth (even if capped by TCP at a BW smaller than 100Gbps), the theoretical algorithm bandwidth of BytePS should be twice as NCCL's, according to BytePS's docs and benchmark.

If I remember correctly, each NCCL ring opens multiple threads and TCP connections. Does BytePS do the same, or there is only one TCP connection between a pair of worker and server? Anyway, I would try adding more servers and see if things improve. Thanks.

azuresol commented 4 years ago

I used 16 servers (4 VMs in total) and 4 workers. It turns out that with more servers BytePS still underperforms :(

16 servers
{ 64: 645.2358540538997,
  128: 1114.3372700397051,
  256: 1446.9432939826472,
  512: 1890.3294603886657,
  1024: 2102.646180203336}

8 servers:
{ 64: 1002.1344917894053,
  128: 1226.0040569054331,
  256: 1526.4048331427127,
  512: 1767.0737988043104,
  1024: 1960.293212291152}

@ymjiang: OMP_WAIT_POLICY=PASSIVE trick does not work for me either.

bobzhuyb commented 4 years ago

@azuresol Your results show that more servers lead to slower throughput? It does not make sense to me.. Can you share your latest code?

Between a pair of worker/server, there is only on TCP connection. Even with more servers, there still exists bottlenecks at the ZeroMQ, which BytePS inherits from the original ps-lite. ZeroMQ has serious throughput problem (see https://github.com/zeromq/libzmq/issues/3525 that I asked libzmq a while ago). We thought we had a workaround, but it later seemed that it only solved a part of the problem. To make BytePS+TCP really work well beyond 25Gbps, we probably need to re-implement the TCP part without ZeroMQ.

azuresol commented 4 years ago

To be precise, I used 4 server VMs and increased the number of server processes by repeating the hosts in server-hostfile. My benchmark code is pretty much same as before. For completeness, here is the command I run:

/opt/conda/bin/python byteps/launcher/dist_launcher.py --worker-hostfile=WORKER_HOSTS --server-hostfile=SERVER_HOSTS --scheduler-ip=10.128.0.7 --scheduler-port=12345 --username=root --env=NVIDIA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 'echo this is $DMLC_ROLE; /opt/conda/bin/python byteps/launcher/launch.py /opt/conda/bin/python allreduce_bench.py'

I would like to make sure that it is not the issue of my setup. Let me know whether you are able to reproduce this result in 100Gbps TCP network.

bobzhuyb commented 4 years ago

@azuresol I kind of understand the problem you have with more servers. Can you set BYTEPS_PARTITION_BYTES=1024000, or maybe 512000? This should significantly improve the performance of 16 servers case. The default value 4096000 is good for 100G RDMA, but not for your case.

That said, even with this, I am afraid that ZeroMQ would still limit the BytePS throughput to no more than a single thread memcpy

azuresol commented 4 years ago

Setting BYTEPS_PARTITION_BYTES=1024000 did improve the performance of 16 servers. However, 512000 led to worse results.. What's the rationale of tuning BYTEPS_PARTITION_BYTES?

BYTEPS_PARTITION_BYTES=1024000:
{ 64: 1117.1740602953446,
  128: 1514.1138371096351,
  256: 1697.4191211669086,
  512: 2032.568272083459,
  1024: 2190.6864842206014}

BYTEPS_PARTITION_BYTES=512000:
{ 64: 694.2803987107301,
  128: 1139.2819746015773,
  256: 1525.4919535014435,
  512: 1900.7866386294816,
  1024: 2013.8684039364014}

It is true that the performance gap still exists, let alone the theoretical 2x improvements. What's the best approach to confirm that ZeroMQ is the bottleneck?

bobzhuyb commented 4 years ago

BYTEPS_PARTITION_BYTES defines how byteps partition large tensors. Smaller partition sizes give you better push/pull pipelining, but also hurt the network stack performance..

If you really want to see the performance improvement on GCP, you can use VMs with 25Gbps or even 10Gbps network. You can also use iftop to confirm that BytePS indeed saves bandwidth compared with all-reduce.

If your target scenario is only 100Gbps TCP, you may have to wait for a few weeks until we re-implement the TCP part.

azuresol commented 4 years ago

Hi @bobzhuyb: What is the typical throughput number for the test_benchmark test in 100Gbps (or your typical setup) RDMA and TCP networks? Just wanted to help myself understand how to interpret my own result. Thanks.

bobzhuyb commented 4 years ago

@azuresol If you have 100G RDMA network, test_benchmark should get you >85Gbps application throughput. For TCP, it should be similar to your iperf single connection performance.

We do have something coming very soon. https://github.com/bytedance/ps-lite/pull/31 Hopefully, we can get better performance with TCP. Stay tuned..

bytedance / byteps

Much slower than Horovod on TCP #237