worse performance than ps

wangshuaizs commented 5 years ago

Hi,

I ran experiment according to the guide in https://github.com/bytedance/byteps/blob/master/docs/best-practice.md#multi-machine-distributed-mode. I have 4 machines, each having one ConnectX-3 NIC and two K40c GPUs. I set the NIC speed at 10Gbps and I use TCP only, instead of RoCE. Two machines are workers, and the other two are parameter servers, and the scheduler runs in server-0. Only workers use GPU(1 or 2). I use docker images you provide to run experiment.

For single-GPU training (without distributed mode), the training speed is 54 imgs/sec. But for distributed training, the total training speed is 51.2 imgs/sec and 103.1 imgs/sec for 1-GPU worker and 2-GPU worker, respectively. In contrast, the training with regular PS architecture is 63.08 imgs/sec and 121.51 imgs/sec, respectively. Why is byteps worse than PS? Is there something wrong with my experiment?

The following are specific cmds: scheduler:

docker pull bytepsimage/byteps_server
docker run -it --net=host bytepsimage/byteps_server bash
export DMLC_NUM_WORKER=2 
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=12.12.11.12
export DMLC_PS_ROOT_PORT=9000
export DMLC_INTERFACE=eth3.10
python /usr/local/byteps/launcher/launch.py

server-0 and server-1:

docker pull bytepsimage/byteps_server
docker run -it --net=host bytepsimage/byteps_server bash
export DMLC_NUM_WORKER=2 
export DMLC_ROLE=server  
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=12.12.11.12
export DMLC_PS_ROOT_PORT=9000
export DMLC_INTERFACE=eth3.10
python /usr/local/byteps/launcher/launch.py

worker-0:

docker pull bytepsimage/worker_tensorflow
nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/worker_tensorflow bash
export NVIDIA_VISIBLE_DEVICES=0 (or 0,1 for 2-GPU worker)
export DMLC_WORKER_ID=0 
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=2 
export DMLC_PS_ROOT_URI=12.12.11.12 
export DMLC_PS_ROOT_PORT=9000
export DMLC_INTERFACE=eth3.10
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py /usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh  --model ResNet50 --num-iters 10

worker-1:

docker pull bytepsimage/worker_tensorflow
nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/worker_tensorflow bash
export NVIDIA_VISIBLE_DEVICES=0 
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=2 
export DMLC_PS_ROOT_URI=12.12.11.12 
export DMLC_PS_ROOT_PORT=9000
export DMLC_INTERFACE=eth3.10
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py /usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh  --model ResNet50 --num-iters 10

and the interconnect topology of GPU and NIC is as follows: worker-0:

$ nvidia-smi topo -m
        GPU0    GPU1    mlx4_0  CPU Affinity
GPU0     X      PHB     PHB     8-15,24-31
GPU1    PHB      X      PHB     8-15,24-31
mlx4_0  PHB     PHB      X

worker-1:

$ nvidia-smi topo -m
        GPU0    GPU1    mlx5_0  mlx4_0  CPU Affinity
GPU0     X      SYS     PHB     SYS     0-7,16-23
GPU1    SYS      X      SYS     PHB     8-15,24-31
mlx5_0  PHB     SYS      X      SYS
mlx4_0  SYS     PHB     SYS      X

ymjiang commented 5 years ago

Could you please try adding more PS instances on server machines? That is, put two PS instances on one physical machine. This should help improve the performance.

Besides, how do you test the performance of "regular PS architecture"? The script you use for benchmark (byteps/example/tensorflow/run_tensorflow_byteps.sh) is only valid for BytePS or Horovod. I think you are using another script to measure the performance of "regular PS". So I am not sure if this is a fair comparison, since the benchmark code we provide is preliminary and could be further optimized.

bobzhuyb commented 5 years ago

I am also confused about your "regular PS architecture"... Did you modify the script yourself? If so, can you upload the regular PS script here? A lot of detailed configuration could impact the results. For example, the batch size. If possible, would you also upload the outputs? Thank you.

Also, would you run more iterations? Run a very large number of iterations, and check the performance after it stabilizes.

wangshuaizs commented 5 years ago

@ymjiang @bobzhuyb , yes, I used another script provided by tensorflow team https://github.com/tensorflow/benchmarks. For fair comparison, I tested their performance on 1 single GPU, PS being 55.3 imgs/sec and bytesps being 54 imgs/sec. So I think the difference between two models, i.e. ResNet-50 provided by Keras, which you used, and ResNet-50 provided by tensorflow benchmark, is negligible. And the batch sizes of both training are 32 per GPU. I ran 100 iterations. Generally speaking, graph initializing can be done in the first few iterations, so I think 100 iterations are enough. And both trainings run 100 iterations, so it is also fair from this point.

p.s. To illustrate the advantage of byteps over horovod and PS, I think it would be better for you to provide example scripts of the same models to run with horovod or PS.

bobzhuyb commented 5 years ago

@wangshuaizs Our example script is basically the same as Horovod's. https://github.com/horovod/horovod/blob/master/examples/tensorflow_synthetic_benchmark.py

We were always comparing with Horovod.

We do need to check the official examples from TF. Can you provide detailed commands that you start the official PS, so that we can reproduce your results?

BTW, are you using GPUs on your PS, when you run the native PS example?

wangshuaizs commented 5 years ago

@bobzhuyb , no. Tough there are two GPUs on my server nodes, but I didn't use them in any trainings. That is, I only use CPU for parameter aggregation.

bobzhuyb commented 5 years ago

Okay. So we have two asks --

Would you try Horovod's example https://github.com/horovod/horovod/blob/master/examples/tensorflow_synthetic_benchmark.py, with 1 GPU x 2 workers, and 2 GPU x 2 workers? If you are using NCCL 2.3.7, we would expect that BytePS is the same as Horovod in 1 GPU x 2 workers case, and outperforms Horovod in 2 GPU x 2 workers case.
Would you share with us the commands you run native PS?

wangshuaizs commented 5 years ago

@bobzhuyb , I didn't try horovod. The commands I used to run native PS are as follows: • server-0:

CUDA_VISIBLE_DEVICES='' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=ps --task_index=0

• server-1:

CUDA_VISIBLE_DEVICES='' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=ps --task_index=1

• worker-0:

CUDA_VISIBLE_DEVICES='0' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=worker --task_index=0

• worker-1:

CUDA_VISIBLE_DEVICES='0' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=worker --task_index=1

bobzhuyb commented 5 years ago

Thanks. What is the result of iperf -c 10.10.10.2 from 10.10.10.4, and what about iperf -c 10.10.10.2 -P 4?

wangshuaizs commented 5 years ago

@bobzhuyb ,

$ iperf -c 10.10.10.2
------------------------------------------------------------
Client connecting to 10.10.10.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 10.10.10.4 port 46532 connected with 10.10.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.35 GBytes  3.73 Gbits/sec
$ iperf -c 10.10.10.2 -P 4
------------------------------------------------------------
Client connecting to 10.10.10.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  5] local 10.10.10.4 port 46540 connected with 10.10.10.2 port 5001
[  4] local 10.10.10.4 port 46534 connected with 10.10.10.2 port 5001
[  3] local 10.10.10.4 port 46536 connected with 10.10.10.2 port 5001
[  6] local 10.10.10.4 port 46538 connected with 10.10.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.04 GBytes   890 Mbits/sec
[  3]  0.0-10.0 sec  1.17 GBytes  1.01 Gbits/sec
[  6]  0.0-10.0 sec   998 MBytes   837 Mbits/sec
[  4]  0.0-10.0 sec  1.23 GBytes  1.05 Gbits/sec
[SUM]  0.0-10.0 sec  4.41 GBytes  3.78 Gbits/sec

bobzhuyb commented 5 years ago

@wangshuaizs wow, even four threads can only get 3.7Gbps? ....

I agree, even in this case, byteps should be as good as native PS. But this is absolutely abnormal.... What about you removing the rate limit on NIC?

wangshuaizs commented 5 years ago

@bobzhuyb , I think I reserved too much switch buffer for RDMA traffic, so TCP performance is bad. After recovering to normal setting, both one thread and 4 threads can achieve 9.34Gbps.

And I re-run the previous experiment, both byteps and native PS are improved. For byteps, the total training speed is 70.7 imgs/sec and 131.9 imgs/sec for 1-GPU worker and 2-GPU worker, respectively. In contrast, the training with native PS is 83.36 imgs/sec and 165.62 imgs/sec, respectively.

ymjiang commented 5 years ago

@wangshuaizs Based on the Tensorflow-benchmark, we did a horovod-BytePS transformation. Could you please try this code to run BytePS: https://github.com/ymjiang/benchmarks/commit/79a41ffb08d8cc00828f8c8aad6dfc02b9071f70

And don't forget to set --variable_update=horovod to really use BytePS.

wangshuaizs commented 5 years ago

@ymjiang , thank you for your transformation. But how do I run byteps? I tried to run the cmd which I use to run native PS, but replace --variable_update=parameter_server with --variable_update=horovod. But the output is

[2019-07-01 17:10:36.809049: F byteps/common/communicator.cc:65] Check failed: getenv("BYTEPS_LOCAL_RANK") error: env BYTEPS_LOCAL_RANK not set
Aborted (core dumped)

Though I tried to set this environment variable, some other environment variables are still not set. Would you mind uploading the cmds you used?

ymjiang commented 5 years ago

@wangshuizs Launch the script with our launcher. The command should be something like this

python byteps/launcher/launch.py benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py YOUR_ARG

(The horovod ARG may be a bit different from the parameter_server case, please check)

We just want to make sure BytePS and "regular PS" all use Tensorflow and exclude the impact of Keras.

Besides, just to make sure another thing. In your iperf test, what would be the average bandwidth if you set -P 1 and -P 2?

ymjiang commented 5 years ago

@wangshuaizs just fixed a bug: https://github.com/ymjiang/benchmarks/commit/00e07d21ce1998e1f9d66189c7b25d960baee704

Sorry about my mistakes.

wangshuaizs commented 5 years ago

@ymjiang , sorry for my late response. I run you code, i.e., byteps, and horovod. For byteps, there are still 2 servers and 2 1-GPU workers, and for horovod, there are only 2 1-GPU workers. Both experiment run in physical machines, without using docker. The training speed of the former is 73.04 imgs/sec and the latter is 106.24 img/sec. In contrast, the training speed of native PS is 83.36 imgs/sec. Here are the detailed output of both experiments: byteps:

$ NVIDIA_VISIBLE_DEVICES='0' DMLC_INTERFACE=eth2 DMLC_ROLE=worker DMLC_PS_ROOT_URI=12.12.11.12 DMLC_PS_ROOT_PORT=9000 DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=2 python /home/shuai/test/byteps/launcher/launch.py python /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50
BytePS launching worker
[00:07:24] src/customer.cc:363: Do not use thread pool for receiving.
[00:07:24] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
TensorFlow:  1.10
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating model
W0702 00:07:39.405329 139738734077696 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-07-02 00:07:40.004248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.80GiB
2019-07-02 00:07:40.004300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-02 00:07:40.004346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-02 00:07:40.004355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2019-07-02 00:07:40.004361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2019-07-02 00:07:40.005003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:83:00.0, compute capability: 3.5)
I0702 00:07:40.771152 139738734077696 tf_logging.py:115] Running local_init_op.
I0702 00:07:41.056993 139738734077696 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step    Img/sec total_loss
10      images/sec: 36.7 +/- 1.3 (jitter = 3.5)    8.152
20      images/sec: 37.1 +/- 0.8 (jitter = 3.4)    8.058
30      images/sec: 36.6 +/- 0.7 (jitter = 5.6)    8.231
40      images/sec: 36.5 +/- 0.6 (jitter = 4.4)    8.379
50      images/sec: 36.3 +/- 0.5 (jitter = 4.0)    8.180
60      images/sec: 36.3 +/- 0.5 (jitter = 3.6)    8.495
70      images/sec: 36.5 +/- 0.4 (jitter = 4.8)    8.132
80      images/sec: 36.6 +/- 0.4 (jitter = 5.6)    7.952
90      images/sec: 36.5 +/- 0.4 (jitter = 4.7)    8.245
100     images/sec: 36.5 +/- 0.4 (jitter = 4.5)    8.285
----------------------------------------------------------------
total images/sec: 73.04
----------------------------------------------------------------

horovod:

$ mpirun -np 2 -H 12.12.10.13:1,12.12.10.12:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME=eth2 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth2 python /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50 --horovod_device=cpu
TensorFlow:  1.10
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
Horovod on:  cpu
==========
TensorFlow:  1.10
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
Horovod on:  cpu
==========
Generating model
Generating model
W0701 23:55:42.762805 139826748643072 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0701 23:55:42.841541 139757391841024 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-07-01 23:55:44.173697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:03:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2019-07-01 23:55:44.173743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-01 23:55:44.384144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2019-07-01 23:55:44.384198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-01 23:55:44.547270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 23:55:44.547319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2019-07-01 23:55:44.547329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2019-07-01 23:55:44.548071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0, compute capability: 3.5)
2019-07-01 23:55:44.772633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 23:55:44.772691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2019-07-01 23:55:44.772700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2019-07-01 23:55:44.773454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:83:00.0, compute capability: 3.5)
I0701 23:55:45.281400 139826748643072 tf_logging.py:115] Running local_init_op.
I0701 23:55:45.386076 139757391841024 tf_logging.py:115] Running local_init_op.
I0701 23:55:45.579767 139826748643072 tf_logging.py:115] Done running local_init_op.
I0701 23:55:45.689666 139757391841024 tf_logging.py:115] Done running local_init_op.
Running warm up
Running warm up
Done warm up
Step    Img/sec total_loss
Done warm up
Step    Img/sec total_loss
10      images/sec: 52.3 +/- 0.7 (jitter = 0.4)    8.502
10      images/sec: 52.3 +/- 0.7 (jitter = 0.1)    8.153
20      images/sec: 52.6 +/- 0.4 (jitter = 0.5)    8.054
20      images/sec: 52.6 +/- 0.4 (jitter = 0.2)    8.057
30      images/sec: 52.9 +/- 0.3 (jitter = 0.3)    8.350
30      images/sec: 52.9 +/- 0.3 (jitter = 0.1)    8.233
40      images/sec: 53.0 +/- 0.2 (jitter = 0.3)    8.282
40      images/sec: 53.0 +/- 0.2 (jitter = 0.2)    8.380
50      images/sec: 53.0 +/- 0.2 (jitter = 0.3)    8.158
50      images/sec: 53.0 +/- 0.2 (jitter = 0.2)    8.181
60      images/sec: 53.1 +/- 0.1 (jitter = 0.3)    8.254
60      images/sec: 53.1 +/- 0.1 (jitter = 0.2)    8.491
70      images/sec: 53.1 +/- 0.1 (jitter = 0.3)    8.572
70      images/sec: 53.1 +/- 0.1 (jitter = 0.2)    8.121
80      images/sec: 53.1 +/- 0.1 (jitter = 0.3)    8.253
80      images/sec: 53.1 +/- 0.1 (jitter = 0.2)    7.952
90      images/sec: 53.1 +/- 0.1 (jitter = 0.3)    8.116
90      images/sec: 53.1 +/- 0.1 (jitter = 0.2)    8.249
100     images/sec: 53.1 +/- 0.1 (jitter = 0.3)    8.064
----------------------------------------------------------------
total images/sec: 106.24
----------------------------------------------------------------
100     images/sec: 53.1 +/- 0.1 (jitter = 0.2)    8.288
----------------------------------------------------------------
total images/sec: 106.24
----------------------------------------------------------------

And the output of iperf is as follows:

$ iperf -c 12.12.10.13 -P 1
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 12.12.10.14 port 34844 connected with 12.12.10.13 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  10.4 GBytes  8.90 Gbits/sec
$ iperf -c 12.12.10.13 -P 2
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 12.12.10.14 port 34848 connected with 12.12.10.13 port 5001
[  4] local 12.12.10.14 port 34846 connected with 12.12.10.13 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.27 GBytes  4.52 Gbits/sec
[  4]  0.0-10.0 sec  4.59 GBytes  3.94 Gbits/sec
[SUM]  0.0-10.0 sec  9.86 GBytes  8.46 Gbits/sec
$ iperf -c 12.12.10.13 -P 4
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  5] local 12.12.10.14 port 34856 connected with 12.12.10.13 port 5001
[  4] local 12.12.10.14 port 34850 connected with 12.12.10.13 port 5001
[  3] local 12.12.10.14 port 34852 connected with 12.12.10.13 port 5001
[  6] local 12.12.10.14 port 34854 connected with 12.12.10.13 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  3.64 GBytes  3.13 Gbits/sec
[  6]  0.0-10.0 sec  1.83 GBytes  1.57 Gbits/sec
[  4]  0.0-10.0 sec  3.64 GBytes  3.12 Gbits/sec
[  3]  0.0-10.0 sec  1.84 GBytes  1.58 Gbits/sec
[SUM]  0.0-10.0 sec  10.9 GBytes  9.40 Gbits/sec

Did you ever tried to run tf benchmark in your cluster? If you did, how about the performance of byteps/horovod/ps? I think it would be better for you to run byteps/horovod/ps with tf benchmark code in your cluster.

bobzhuyb commented 5 years ago

@wangshuaizs Thank you for the detailed traces. We are reproducing it now.

bobzhuyb commented 5 years ago

@wangshuaizs I can't reproduce your results. Not sure what is going on here. We need more diagnosis.

Here is my setup: 2 VMs with one V100 GPU as workers; 2 cpu-only VMs as servers. iperf -P 1 gets 13.7Gbps, iperf -P 4 gets 21.1Gbps

pip installed tensorflow-gpu 1.13.1, horovod 0.16.4. byteps master branch benchmark code: https://github.com/bobzhuyb/benchmarks/tree/byteps_1.13 (for byteps) and https://github.com/bobzhuyb/benchmarks/tree/cnn_tf_v1.13_compatible (for horovod)

I used the same tf_cnn_benchmarks.py arguments as yours (see below). BytePS gets 389 images/sec, Horovod gets 298.29 image/sec. By the way, native PS gets the same speed (389 images/sec) as BytePS, which is kind of expected in this simple setup.

BytePS (DMLC_* env has been configured as yours):

python /opt/tiger/byteps/launcher/launch.py python /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50
Done warm up
Step    Img/sec total_loss
1       images/sec: 196.7 +/- 0.0 (jitter = 0.0)        8.170
10      images/sec: 188.1 +/- 2.3 (jitter = 5.7)        7.592
20      images/sec: 185.1 +/- 1.8 (jitter = 3.3)        7.694
30      images/sec: 188.4 +/- 2.1 (jitter = 4.4)        7.757
40      images/sec: 191.6 +/- 2.0 (jitter = 10.4)       7.999
50      images/sec: 194.3 +/- 1.8 (jitter = 16.1)       7.516
60      images/sec: 192.3 +/- 2.2 (jitter = 13.7)       7.982
70      images/sec: 193.5 +/- 1.9 (jitter = 10.5)       8.019
80      images/sec: 194.5 +/- 1.7 (jitter = 9.1)        7.931
90      images/sec: 195.9 +/- 1.6 (jitter = 9.4)        7.846
100     images/sec: 194.7 +/- 1.7 (jitter = 8.4)        7.765
----------------------------------------------------------------
total images/sec: 389.24
----------------------------------------------------------------

Horovod (note: your command does not use NCCL. So below I don't use NCCL either):

mpirun --allow-run-as-root -np 2 -H worker-0:1,worker-1:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME=eth0 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 python /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50 --horovod_device=cpu
Done warm up
Done warm up
Step    Img/sec total_loss
1       images/sec: 146.7 +/- 0.0 (jitter = 0.0)        8.170
1       images/sec: 146.3 +/- 0.0 (jitter = 0.0)        7.906
10      images/sec: 149.1 +/- 0.6 (jitter = 1.0)        8.035
10      images/sec: 149.1 +/- 0.7 (jitter = 1.3)        7.591
20      images/sec: 149.4 +/- 0.4 (jitter = 1.4)        7.786
20      images/sec: 149.3 +/- 0.4 (jitter = 1.1)        7.694
30      images/sec: 149.2 +/- 0.4 (jitter = 1.6)        7.583
30      images/sec: 149.2 +/- 0.4 (jitter = 1.3)        7.758
40      images/sec: 149.7 +/- 0.3 (jitter = 1.3)        8.323
40      images/sec: 149.7 +/- 0.3 (jitter = 1.1)        8.004
50      images/sec: 150.0 +/- 0.3 (jitter = 1.7)        8.066
50      images/sec: 150.0 +/- 0.3 (jitter = 1.4)        7.515
60      images/sec: 150.1 +/- 0.3 (jitter = 1.7)        8.024
60      images/sec: 150.1 +/- 0.3 (jitter = 1.3)        7.981
70      images/sec: 150.2 +/- 0.2 (jitter = 1.8)        8.079
70      images/sec: 150.2 +/- 0.2 (jitter = 1.5)        8.011
80      images/sec: 148.8 +/- 0.9 (jitter = 1.5)        7.935
80      images/sec: 148.8 +/- 0.9 (jitter = 1.5)        7.905
90      images/sec: 149.1 +/- 0.8 (jitter = 1.4)        7.852
90      images/sec: 149.1 +/- 0.8 (jitter = 1.6)        7.800
100     images/sec: 149.2 +/- 0.7 (jitter = 1.7)        7.973
----------------------------------------------------------------
total images/sec: 298.29
----------------------------------------------------------------
100     images/sec: 149.2 +/- 0.7 (jitter = 1.4)        7.801
----------------------------------------------------------------
total images/sec: 298.29
----------------------------------------------------------------

wangshuaizs commented 5 years ago

@bobzhuyb , thank you for your reproduced experiment. I will dig the problem of my environment.

By the way, how about the training speed of single-gpu training (i.e., without distributed mode) on V100?

bobzhuyb commented 5 years ago

@wangshuaizs I tried with the horovod command, with -np 1, it runs at 301 images/sec

bobzhuyb commented 4 years ago

@wangshuaizs

If you still have performance problem, there are a few things you can try. If any of the following works for you, please let us know. Though the following env may start with MXNET, they apply to any workers, TF/MXNet/PyTorch, because the parameter server is based on MXNet.

For the parameter servers, set export MXNET_OMP_MAX_THREADS=10 if you have 16 CPU cores per server. Set export MXNET_OMP_MAX_THREADS=4 if you only have 8 CPU cores

Set export MXNET_CPU_WORKER_NTHREADS=32 . This may speed up the parameter server

Start more parameter server instances. For example, when you have two physical machines to run the servers, you can start 4 (DMLC_NUM_SERVER=4), two server instances per physical machine. This will increase the network bandwidth utilization especially when your single TCP flow cannot saturate your bandwidth.

bytedance / byteps

worse performance than ps #29