The back-to-back (without going through the switch) result of 2 machines 16 gpus are quietly strange

wuyujiji commented 3 years ago

Excuse me！Recently I want to test the back-to-back (without going through the switch) nccl-test's (disabled GDR) result of 2 machines 16 gpus and meet a very strange phenomenon。Details as follows:

NCCL Version：v2.7.8-1 CUDA Version: 10.0 or 11.0 Machine: PCIe-V100

1. The topo of the two machines (both PCIe-V100) are the same, as shown below：

2. Run nccl-test using nccl's default build ring method

command as follows:

mpirun --allow-run-as-root --hostfile eth1_hostfile \
--prefix /usr/local/ompi -bind-to none \
-map-by slot \
--display-map --tag-output --timestamp-output \
--mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 \
--mca btl tcp,self,vader \
-x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_1:1 -x NCCL_SOCKET_IFNAME=eth1 -x NCCL_NET_GDR_LEVEL=0 -x NCCL_NET_GDR_READ=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_DISABLE=0 -x HOROVOD_MPI_THREADS_DISABLE=1 -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

the nccl info channel is:

the bandwidth is only about 1.8GB，shown as follows：

3. Run nccl-test using self-defined build ring method

In order to verify whether it is caused by the default building ring, I customize the graph.txt and change the nccl info channel（before the gpu order is: 4->5->6->7->3->2->1->0 and now is: 3->2->1->0->4->5->6->7 ）：

command:

mpirun --allow-run-as-root --hostfile eth1_hostfile \
--prefix /usr/local/ompi -bind-to none \
-map-by slot \
--display-map --tag-output --timestamp-output \
--mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 \
--mca btl tcp,self,vader \
-x NCCL_ALGO=Ring -x NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH,ENV -x NCCL_GRAPH_FILE=./graph.txt -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_1:1 -x NCCL_SOCKET_IFNAME=eth1 -x NCCL_NET_GDR_LEVEL=0 -x NCCL_NET_GDR_READ=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_DISABLE=0 -x HOROVOD_MPI_THREADS_DISABLE=1 -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

graph.txt:

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="9" speedinter="9" typeintra="SYS" typeinter="SYS" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="4"/>
      <gpu dev="5"/>
      <gpu dev="6"/>
      <gpu dev="7"/>
      <net dev="0"/>
    </channel>
  </graph>

the nccl channel of self-defined way is:

However, the result is up to 5GB, and I do not know why.

4. Re-verify the results of the default buiding ring replacement by using a self-define building ring

I doubt the strange phenomenon whether it is caused by parameters (-x NCCL_ALGO=Ring -x NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH,ENV -x NCCL_GRAPH_FILE=./graph.txt). Therefore, I use self-defined way to bulding the default the channel, namely is 4->5->6->7->3->2->1->0. However, the result is the same as before, only about 1.8GB

graph.txt:

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="9" speedinter="9" typeintra="SYS" typeinter="SYS" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="4"/>
      <gpu dev="5"/>
      <gpu dev="6"/>
      <gpu dev="7"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <net dev="0"/>
    </channel>
  </graph>

the channel is:

sjeaugey commented 3 years ago

Why did you set NCCL_NET_GDR_LEVEL=0 ? That would disable GPU Direct RDMA and affect negatively performance.

Other than that, I think we've had reports of rings being suboptimal when the NIC was close to GPUs 4-5-6-7 rather than 0-1-2-3. You could probably set CUDA_VISIBLE_DEVICES=4,5,6,7,0,1,2,3 to solve the problem, but it's not impossible NCCL chooses a suboptimal order otherwise. Now performance should be close to 7-8 GB/s, not 1-2 GB/s, which is probably because GPU DirectRDMA is disabled.

wuyujiji commented 3 years ago

Thanks for your quickly reply! the GDR result's is about 8GB. The reason for setting NCCL_NET_GDR_LEVEL=0 because we have scenarios where GDR cannot be turned on。The inter-communication between machines is similar to the PS architecture, so we want to use nccl to measure the performance of the cluster rdma network.

When I set CUDA_VISIBLE_DEVICES=4,5,6,7,0,1,2,3, the result is also 1.7~1.8GB

wuyujiji commented 3 years ago

Actually we want to know why the impact will be so big after changing the ring from 4->5->6->7->3->2->1->0 to 3->2->1->0->4->5->6->7?

wuyujiji commented 3 years ago

@sjeaugey Looking forward to your answer～

sjeaugey commented 3 years ago

Changing the ring means flows will be in a different direction. NIC->local memory .... distant memory->NIC versus NIC->distant memory ... local memory -> NIC. I guess that may explain the performance difference. Now this is not a case we often optimize for, hence we don't have much experience with this. We assume if there is a PCI switch, users will want to use GDR as it would give much higher performance (as is the case here).

And I'm not sure I get how using NCCL with GDR disabled will measure the performance of the cluster RDMA network; it seems to me the bottleneck would be on the CPU side rather than on the NIC.

NVIDIA / nccl