In two 8-GPU nodes，run slowly than horovod 0.19.0, so how to tune params?

ghost commented 4 years ago

Describe the bug A clear and concise description of what the bug is.

Test with two nodes, byteps run slowly than horovod. so i think there are some configures than i not set for the most performance. which params envs need to finetune to get the fastest images process every sec？

many thanks～

To Reproduce Steps to reproduce the behavior: 1. hard devices : two node with 8*V100, 100Gb-rmda by roce v2, Mellanox; 2. soft ： centos7.7 cuda10.0 cudnn7.6 python3 tensorflow_gpu== 1.15.2 gcc 7.5 nccl 2.5.6 office download 3. sample code : https://github.com/horovod/horovod/blob/master/examples/tensorflow_mnist.py change model to resnet 50 change dataset to imagenet

byteps configure: byteps : each node deploy one server and one worker,
'DMLC_ENABLE_RDMA': 1, 'BYTEPS_ENABLE_MIXED_MODE': 1, 'ENABLE_RDMA_LOG' : 0, 'BYTEPS_ENABLE_IPC': 1, 'BYTEPS_PRINT_RDMA_LOG' : 0, 'PS_VERBOSE':0, 'BYTEPS_SERVER_ENGINE_THREAD' : 12, 'DMLC_NUM_WORKER' : 2, 'DMLC_NUM_SERVER': 2, horovod use nccl allreduce , brocast fusion_threshold_mb: 16 cycle_time_ms: 10

Expected behavior A clear and concise description of what you expected to happen.

the byteps result： INFO:tensorflow:loss = 6.908525, step = 0 INFO:tensorflow:loss = 6.409476, step = 10 (26.834 sec) INFO:tensorflow:loss = 6.409476, step = 20 (2.103 sec) INFO:tensorflow:loss = 6.409476, step = 30 (2.140 sec) INFO:tensorflow:loss = 6.7844763, step = 40 (2.119 sec) INFO:tensorflow:loss = 6.7688513, step = 50 (2.128 sec)

the horovod 0.19.0 result： NFO:tensorflow:loss = 6.908415, step = 0 INFO:tensorflow:loss = 6.409476, step = 10 (25.748 sec) INFO:tensorflow:loss = 6.409476, step = 20 (1.737 sec) INFO:tensorflow:loss = 6.409476, step = 30 (1.735 sec) INFO:tensorflow:loss = 6.7844763, step = 40 (1.723 sec) INFO:tensorflow:loss = 6.7688513, step = 50 (1.729 sec)

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: Centos 7.7
GCC version: 7.5
CUDA and NCCL version: 10.0 nccl 2.5.6
Framework (TF, PyTorch, MXNet): Tensorflow

Additional context Add any other context about the problem here.

ymjiang commented 4 years ago

Do your machines have NVLink? Can you show the output of nvidia-smi topo -m?

Another quick test is to tune BYTEPS_PARTITION_BYTES, the default value is 4096000. Try to set it higher or lower.

BTW, Horovod has the best performance when using just two machines, due to hierarchical allreduce. So you should see that BytePS has larger gain over Horovod as you use more machines.

bobzhuyb commented 4 years ago

nvidia-smi topo -m output would be very helpful.

How many NICs do you have? Just one 100G port?

With only two machines and no additional CPU machines, BytePS would be equivalent to NCCL tree-based topology.

ghost commented 4 years ago

the topo just like dgx-1, with nvlink and only one mellonx nics. i will try nccl ring to see performance.

many thanks.

bobzhuyb commented 4 years ago

What is the performance result you got from ps-lite benchmark? We are expecting 85~90Gbps.
Did you enable GPU-direct RDMA for Horovod/NCCL?
Does your Mellanox NIC share a PCIe switch with any of your GPUs? If your topology is like DGX-1, then the answer is yes. See Figure 4 in https://devblogs.nvidia.com/dgx-1-fastest-deep-learning-system/ Each NIC connects to one PCIe switch with two GPUs.

If so, you need to configure BYTEPS_REDUCE_ROOTS. For examples, if your NIC shares a PCIe switch with GPU 0 and 1, we would recommend you to set BYTEPS_REDUCE_ROOTS="2,3" (in the same NUMA node but different PCIe switch) in order to avoid the contention on PCIe switch.

bytedance / byteps

In two 8-GPU nodes，run slowly than horovod 0.19.0, so how to tune params? #213