Open wangshuaizs opened 5 years ago
Could you please try adding more PS instances on server machines? That is, put two PS instances on one physical machine. This should help improve the performance.
Besides, how do you test the performance of "regular PS architecture"? The script you use for benchmark (byteps/example/tensorflow/run_tensorflow_byteps.sh) is only valid for BytePS or Horovod. I think you are using another script to measure the performance of "regular PS". So I am not sure if this is a fair comparison, since the benchmark code we provide is preliminary and could be further optimized.
I am also confused about your "regular PS architecture"... Did you modify the script yourself? If so, can you upload the regular PS script here? A lot of detailed configuration could impact the results. For example, the batch size. If possible, would you also upload the outputs? Thank you.
Also, would you run more iterations? Run a very large number of iterations, and check the performance after it stabilizes.
@ymjiang @bobzhuyb , yes, I used another script provided by tensorflow team https://github.com/tensorflow/benchmarks. For fair comparison, I tested their performance on 1 single GPU, PS being 55.3 imgs/sec and bytesps being 54 imgs/sec. So I think the difference between two models, i.e. ResNet-50 provided by Keras, which you used, and ResNet-50 provided by tensorflow benchmark, is negligible. And the batch sizes of both training are 32 per GPU. I ran 100 iterations. Generally speaking, graph initializing can be done in the first few iterations, so I think 100 iterations are enough. And both trainings run 100 iterations, so it is also fair from this point.
p.s. To illustrate the advantage of byteps over horovod and PS, I think it would be better for you to provide example scripts of the same models to run with horovod or PS.
@wangshuaizs Our example script is basically the same as Horovod's. https://github.com/horovod/horovod/blob/master/examples/tensorflow_synthetic_benchmark.py
We were always comparing with Horovod.
We do need to check the official examples from TF. Can you provide detailed commands that you start the official PS, so that we can reproduce your results?
BTW, are you using GPUs on your PS, when you run the native PS example?
@bobzhuyb , no. Tough there are two GPUs on my server nodes, but I didn't use them in any trainings. That is, I only use CPU for parameter aggregation.
Okay. So we have two asks --
Would you try Horovod's example https://github.com/horovod/horovod/blob/master/examples/tensorflow_synthetic_benchmark.py, with 1 GPU x 2 workers, and 2 GPU x 2 workers? If you are using NCCL 2.3.7, we would expect that BytePS is the same as Horovod in 1 GPU x 2 workers case, and outperforms Horovod in 2 GPU x 2 workers case.
Would you share with us the commands you run native PS?
@bobzhuyb , I didn't try horovod. The commands I used to run native PS are as follows: • server-0:
CUDA_VISIBLE_DEVICES='' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=ps --task_index=0
• server-1:
CUDA_VISIBLE_DEVICES='' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=ps --task_index=1
• worker-0:
CUDA_VISIBLE_DEVICES='0' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=worker --task_index=0
• worker-1:
CUDA_VISIBLE_DEVICES='0' python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=parameter_server --optimizer=sgd --batch_size=32 --model=resnet50 --server_protocol=grpc --ps_hosts="10.10.10.2:2222,10.10.10.3:2222" --worker_hosts="10.10.10.4:3333,10.10.10.6:3333" --job_name=worker --task_index=1
Thanks. What is the result of iperf -c 10.10.10.2
from 10.10.10.4, and what about iperf -c 10.10.10.2 -P 4
?
@bobzhuyb ,
$ iperf -c 10.10.10.2
------------------------------------------------------------
Client connecting to 10.10.10.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.4 port 46532 connected with 10.10.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 4.35 GBytes 3.73 Gbits/sec
$ iperf -c 10.10.10.2 -P 4
------------------------------------------------------------
Client connecting to 10.10.10.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 5] local 10.10.10.4 port 46540 connected with 10.10.10.2 port 5001
[ 4] local 10.10.10.4 port 46534 connected with 10.10.10.2 port 5001
[ 3] local 10.10.10.4 port 46536 connected with 10.10.10.2 port 5001
[ 6] local 10.10.10.4 port 46538 connected with 10.10.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 1.04 GBytes 890 Mbits/sec
[ 3] 0.0-10.0 sec 1.17 GBytes 1.01 Gbits/sec
[ 6] 0.0-10.0 sec 998 MBytes 837 Mbits/sec
[ 4] 0.0-10.0 sec 1.23 GBytes 1.05 Gbits/sec
[SUM] 0.0-10.0 sec 4.41 GBytes 3.78 Gbits/sec
@wangshuaizs wow, even four threads can only get 3.7Gbps? ....
I agree, even in this case, byteps should be as good as native PS. But this is absolutely abnormal.... What about you removing the rate limit on NIC?
@bobzhuyb , I think I reserved too much switch buffer for RDMA traffic, so TCP performance is bad. After recovering to normal setting, both one thread and 4 threads can achieve 9.34Gbps.
And I re-run the previous experiment, both byteps and native PS are improved. For byteps, the total training speed is 70.7 imgs/sec and 131.9 imgs/sec for 1-GPU worker and 2-GPU worker, respectively. In contrast, the training with native PS is 83.36 imgs/sec and 165.62 imgs/sec, respectively.
@wangshuaizs Based on the Tensorflow-benchmark, we did a horovod-BytePS transformation. Could you please try this code to run BytePS: https://github.com/ymjiang/benchmarks/commit/79a41ffb08d8cc00828f8c8aad6dfc02b9071f70
And don't forget to set --variable_update=horovod
to really use BytePS.
@ymjiang , thank you for your transformation. But how do I run byteps? I tried to run the cmd which I use to run native PS, but replace --variable_update=parameter_server
with --variable_update=horovod
. But the output is
[2019-07-01 17:10:36.809049: F byteps/common/communicator.cc:65] Check failed: getenv("BYTEPS_LOCAL_RANK") error: env BYTEPS_LOCAL_RANK not set
Aborted (core dumped)
Though I tried to set this environment variable, some other environment variables are still not set. Would you mind uploading the cmds you used?
@wangshuizs Launch the script with our launcher. The command should be something like this
python byteps/launcher/launch.py benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py YOUR_ARG
(The horovod
ARG may be a bit different from the parameter_server
case, please check)
We just want to make sure BytePS and "regular PS" all use Tensorflow and exclude the impact of Keras.
Besides, just to make sure another thing. In your iperf test, what would be the average bandwidth if you set -P 1
and -P 2
?
@wangshuaizs just fixed a bug: https://github.com/ymjiang/benchmarks/commit/00e07d21ce1998e1f9d66189c7b25d960baee704
Sorry about my mistakes.
@ymjiang , sorry for my late response. I run you code, i.e., byteps, and horovod. For byteps, there are still 2 servers and 2 1-GPU workers, and for horovod, there are only 2 1-GPU workers. Both experiment run in physical machines, without using docker. The training speed of the former is 73.04 imgs/sec and the latter is 106.24 img/sec. In contrast, the training speed of native PS is 83.36 imgs/sec. Here are the detailed output of both experiments: byteps:
$ NVIDIA_VISIBLE_DEVICES='0' DMLC_INTERFACE=eth2 DMLC_ROLE=worker DMLC_PS_ROOT_URI=12.12.11.12 DMLC_PS_ROOT_PORT=9000 DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=2 python /home/shuai/test/byteps/launcher/launch.py python /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50
BytePS launching worker
[00:07:24] src/customer.cc:363: Do not use thread pool for receiving.
[00:07:24] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
[00:07:33] src/./zmq_van.h:285: Start ZMQ recv thread
TensorFlow: 1.10
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: horovod
==========
Generating model
W0702 00:07:39.405329 139738734077696 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-07-02 00:07:40.004248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.80GiB
2019-07-02 00:07:40.004300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-02 00:07:40.004346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-02 00:07:40.004355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-07-02 00:07:40.004361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-07-02 00:07:40.005003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:83:00.0, compute capability: 3.5)
I0702 00:07:40.771152 139738734077696 tf_logging.py:115] Running local_init_op.
I0702 00:07:41.056993 139738734077696 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
10 images/sec: 36.7 +/- 1.3 (jitter = 3.5) 8.152
20 images/sec: 37.1 +/- 0.8 (jitter = 3.4) 8.058
30 images/sec: 36.6 +/- 0.7 (jitter = 5.6) 8.231
40 images/sec: 36.5 +/- 0.6 (jitter = 4.4) 8.379
50 images/sec: 36.3 +/- 0.5 (jitter = 4.0) 8.180
60 images/sec: 36.3 +/- 0.5 (jitter = 3.6) 8.495
70 images/sec: 36.5 +/- 0.4 (jitter = 4.8) 8.132
80 images/sec: 36.6 +/- 0.4 (jitter = 5.6) 7.952
90 images/sec: 36.5 +/- 0.4 (jitter = 4.7) 8.245
100 images/sec: 36.5 +/- 0.4 (jitter = 4.5) 8.285
----------------------------------------------------------------
total images/sec: 73.04
----------------------------------------------------------------
horovod:
$ mpirun -np 2 -H 12.12.10.13:1,12.12.10.12:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME=eth2 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth2 python /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50 --horovod_device=cpu
TensorFlow: 1.10
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: horovod
Horovod on: cpu
==========
TensorFlow: 1.10
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['horovod/gpu:0', 'horovod/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: horovod
Horovod on: cpu
==========
Generating model
Generating model
W0701 23:55:42.762805 139826748643072 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0701 23:55:42.841541 139757391841024 tf_logging.py:125] From /home/shuai/test/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1920: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-07-01 23:55:44.173697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:03:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2019-07-01 23:55:44.173743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-01 23:55:44.384144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:83:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2019-07-01 23:55:44.384198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-07-01 23:55:44.547270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 23:55:44.547319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-07-01 23:55:44.547329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-07-01 23:55:44.548071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0, compute capability: 3.5)
2019-07-01 23:55:44.772633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 23:55:44.772691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-07-01 23:55:44.772700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-07-01 23:55:44.773454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11475 MB memory) -> physical GPU (device: 0, name: Tesla K40c, pci bus id: 0000:83:00.0, compute capability: 3.5)
I0701 23:55:45.281400 139826748643072 tf_logging.py:115] Running local_init_op.
I0701 23:55:45.386076 139757391841024 tf_logging.py:115] Running local_init_op.
I0701 23:55:45.579767 139826748643072 tf_logging.py:115] Done running local_init_op.
I0701 23:55:45.689666 139757391841024 tf_logging.py:115] Done running local_init_op.
Running warm up
Running warm up
Done warm up
Step Img/sec total_loss
Done warm up
Step Img/sec total_loss
10 images/sec: 52.3 +/- 0.7 (jitter = 0.4) 8.502
10 images/sec: 52.3 +/- 0.7 (jitter = 0.1) 8.153
20 images/sec: 52.6 +/- 0.4 (jitter = 0.5) 8.054
20 images/sec: 52.6 +/- 0.4 (jitter = 0.2) 8.057
30 images/sec: 52.9 +/- 0.3 (jitter = 0.3) 8.350
30 images/sec: 52.9 +/- 0.3 (jitter = 0.1) 8.233
40 images/sec: 53.0 +/- 0.2 (jitter = 0.3) 8.282
40 images/sec: 53.0 +/- 0.2 (jitter = 0.2) 8.380
50 images/sec: 53.0 +/- 0.2 (jitter = 0.3) 8.158
50 images/sec: 53.0 +/- 0.2 (jitter = 0.2) 8.181
60 images/sec: 53.1 +/- 0.1 (jitter = 0.3) 8.254
60 images/sec: 53.1 +/- 0.1 (jitter = 0.2) 8.491
70 images/sec: 53.1 +/- 0.1 (jitter = 0.3) 8.572
70 images/sec: 53.1 +/- 0.1 (jitter = 0.2) 8.121
80 images/sec: 53.1 +/- 0.1 (jitter = 0.3) 8.253
80 images/sec: 53.1 +/- 0.1 (jitter = 0.2) 7.952
90 images/sec: 53.1 +/- 0.1 (jitter = 0.3) 8.116
90 images/sec: 53.1 +/- 0.1 (jitter = 0.2) 8.249
100 images/sec: 53.1 +/- 0.1 (jitter = 0.3) 8.064
----------------------------------------------------------------
total images/sec: 106.24
----------------------------------------------------------------
100 images/sec: 53.1 +/- 0.1 (jitter = 0.2) 8.288
----------------------------------------------------------------
total images/sec: 106.24
----------------------------------------------------------------
And the output of iperf is as follows:
$ iperf -c 12.12.10.13 -P 1
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 12.12.10.14 port 34844 connected with 12.12.10.13 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 10.4 GBytes 8.90 Gbits/sec
$ iperf -c 12.12.10.13 -P 2
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 12.12.10.14 port 34848 connected with 12.12.10.13 port 5001
[ 4] local 12.12.10.14 port 34846 connected with 12.12.10.13 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 5.27 GBytes 4.52 Gbits/sec
[ 4] 0.0-10.0 sec 4.59 GBytes 3.94 Gbits/sec
[SUM] 0.0-10.0 sec 9.86 GBytes 8.46 Gbits/sec
$ iperf -c 12.12.10.13 -P 4
------------------------------------------------------------
Client connecting to 12.12.10.13, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 5] local 12.12.10.14 port 34856 connected with 12.12.10.13 port 5001
[ 4] local 12.12.10.14 port 34850 connected with 12.12.10.13 port 5001
[ 3] local 12.12.10.14 port 34852 connected with 12.12.10.13 port 5001
[ 6] local 12.12.10.14 port 34854 connected with 12.12.10.13 port 5001
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 3.64 GBytes 3.13 Gbits/sec
[ 6] 0.0-10.0 sec 1.83 GBytes 1.57 Gbits/sec
[ 4] 0.0-10.0 sec 3.64 GBytes 3.12 Gbits/sec
[ 3] 0.0-10.0 sec 1.84 GBytes 1.58 Gbits/sec
[SUM] 0.0-10.0 sec 10.9 GBytes 9.40 Gbits/sec
Did you ever tried to run tf benchmark in your cluster? If you did, how about the performance of byteps/horovod/ps? I think it would be better for you to run byteps/horovod/ps with tf benchmark code in your cluster.
@wangshuaizs Thank you for the detailed traces. We are reproducing it now.
@wangshuaizs I can't reproduce your results. Not sure what is going on here. We need more diagnosis.
Here is my setup: 2 VMs with one V100 GPU as workers; 2 cpu-only VMs as servers. iperf -P 1
gets 13.7Gbps, iperf -P 4
gets 21.1Gbps
pip installed tensorflow-gpu 1.13.1, horovod 0.16.4. byteps master branch benchmark code: https://github.com/bobzhuyb/benchmarks/tree/byteps_1.13 (for byteps) and https://github.com/bobzhuyb/benchmarks/tree/cnn_tf_v1.13_compatible (for horovod)
I used the same tf_cnn_benchmarks.py arguments as yours (see below). BytePS gets 389 images/sec, Horovod gets 298.29 image/sec. By the way, native PS gets the same speed (389 images/sec) as BytePS, which is kind of expected in this simple setup.
BytePS (DMLC_* env has been configured as yours):
python /opt/tiger/byteps/launcher/launch.py python /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50
Done warm up
Step Img/sec total_loss
1 images/sec: 196.7 +/- 0.0 (jitter = 0.0) 8.170
10 images/sec: 188.1 +/- 2.3 (jitter = 5.7) 7.592
20 images/sec: 185.1 +/- 1.8 (jitter = 3.3) 7.694
30 images/sec: 188.4 +/- 2.1 (jitter = 4.4) 7.757
40 images/sec: 191.6 +/- 2.0 (jitter = 10.4) 7.999
50 images/sec: 194.3 +/- 1.8 (jitter = 16.1) 7.516
60 images/sec: 192.3 +/- 2.2 (jitter = 13.7) 7.982
70 images/sec: 193.5 +/- 1.9 (jitter = 10.5) 8.019
80 images/sec: 194.5 +/- 1.7 (jitter = 9.1) 7.931
90 images/sec: 195.9 +/- 1.6 (jitter = 9.4) 7.846
100 images/sec: 194.7 +/- 1.7 (jitter = 8.4) 7.765
----------------------------------------------------------------
total images/sec: 389.24
----------------------------------------------------------------
Horovod (note: your command does not use NCCL. So below I don't use NCCL either):
mpirun --allow-run-as-root -np 2 -H worker-0:1,worker-1:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME=eth0 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 python /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --num_batches=100 --num_warmup_batches=10 --variable_update=horovod --optimizer=sgd --batch_size=32 --model=resnet50 --horovod_device=cpu
Done warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 146.7 +/- 0.0 (jitter = 0.0) 8.170
1 images/sec: 146.3 +/- 0.0 (jitter = 0.0) 7.906
10 images/sec: 149.1 +/- 0.6 (jitter = 1.0) 8.035
10 images/sec: 149.1 +/- 0.7 (jitter = 1.3) 7.591
20 images/sec: 149.4 +/- 0.4 (jitter = 1.4) 7.786
20 images/sec: 149.3 +/- 0.4 (jitter = 1.1) 7.694
30 images/sec: 149.2 +/- 0.4 (jitter = 1.6) 7.583
30 images/sec: 149.2 +/- 0.4 (jitter = 1.3) 7.758
40 images/sec: 149.7 +/- 0.3 (jitter = 1.3) 8.323
40 images/sec: 149.7 +/- 0.3 (jitter = 1.1) 8.004
50 images/sec: 150.0 +/- 0.3 (jitter = 1.7) 8.066
50 images/sec: 150.0 +/- 0.3 (jitter = 1.4) 7.515
60 images/sec: 150.1 +/- 0.3 (jitter = 1.7) 8.024
60 images/sec: 150.1 +/- 0.3 (jitter = 1.3) 7.981
70 images/sec: 150.2 +/- 0.2 (jitter = 1.8) 8.079
70 images/sec: 150.2 +/- 0.2 (jitter = 1.5) 8.011
80 images/sec: 148.8 +/- 0.9 (jitter = 1.5) 7.935
80 images/sec: 148.8 +/- 0.9 (jitter = 1.5) 7.905
90 images/sec: 149.1 +/- 0.8 (jitter = 1.4) 7.852
90 images/sec: 149.1 +/- 0.8 (jitter = 1.6) 7.800
100 images/sec: 149.2 +/- 0.7 (jitter = 1.7) 7.973
----------------------------------------------------------------
total images/sec: 298.29
----------------------------------------------------------------
100 images/sec: 149.2 +/- 0.7 (jitter = 1.4) 7.801
----------------------------------------------------------------
total images/sec: 298.29
----------------------------------------------------------------
@bobzhuyb , thank you for your reproduced experiment. I will dig the problem of my environment.
By the way, how about the training speed of single-gpu training (i.e., without distributed mode) on V100?
@wangshuaizs I tried with the horovod command, with -np 1, it runs at 301 images/sec
@wangshuaizs
If you still have performance problem, there are a few things you can try. If any of the following works for you, please let us know. Though the following env may start with MXNET, they apply to any workers, TF/MXNet/PyTorch, because the parameter server is based on MXNet.
For the parameter servers, set export MXNET_OMP_MAX_THREADS=10 if you have 16 CPU cores per server. Set export MXNET_OMP_MAX_THREADS=4 if you only have 8 CPU cores
Set export MXNET_CPU_WORKER_NTHREADS=32 . This may speed up the parameter server
Start more parameter server instances. For example, when you have two physical machines to run the servers, you can start 4 (DMLC_NUM_SERVER=4), two server instances per physical machine. This will increase the network bandwidth utilization especially when your single TCP flow cannot saturate your bandwidth.
Hi,
I ran experiment according to the guide in https://github.com/bytedance/byteps/blob/master/docs/best-practice.md#multi-machine-distributed-mode. I have 4 machines, each having one ConnectX-3 NIC and two K40c GPUs. I set the NIC speed at 10Gbps and I use TCP only, instead of RoCE. Two machines are workers, and the other two are parameter servers, and the scheduler runs in server-0. Only workers use GPU(1 or 2). I use docker images you provide to run experiment.
For single-GPU training (without distributed mode), the training speed is 54 imgs/sec. But for distributed training, the total training speed is 51.2 imgs/sec and 103.1 imgs/sec for 1-GPU worker and 2-GPU worker, respectively. In contrast, the training with regular PS architecture is 63.08 imgs/sec and 121.51 imgs/sec, respectively. Why is byteps worse than PS? Is there something wrong with my experiment?
The following are specific cmds: scheduler:
server-0 and server-1:
worker-0:
worker-1:
and the interconnect topology of GPU and NIC is as follows: worker-0:
worker-1: