NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 826 forks source link

The results of nccl-tests of different nccl versions are quite different #409

Open Richie-yan opened 4 years ago

Richie-yan commented 4 years ago

Background: On two 32G V100 * 8 machines, I ran nccl tests and found that use different NCCL versions, the busbw of nccl tests has some differences Run the command as follows: mpirun -np 16 --hostfile hostfile -bind-to none -map-by slot --display-map --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl openib,self,vader -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_ALGO=RING -x NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=0 -x NCCL_GRAPH_FILE=./graph.txt ./all_reduce_perf -b 8 -e 128M -f 2 The above command sets NCCL_ALGO=RING to run the nccl Ring algorithm, NCCL_GRAPH_FILE guarantees topology consistency, and the result of nccl-tests is as follows:

屏幕快照 2020-10-21 下午3 16 19 屏幕快照 2020-10-21 下午3 19 44

With the same machine environment and the same topology, why are the results of nccl-tests run by the two versions different? Hope to help answer it

sjeaugey commented 4 years ago

I don't know what ring is defined in graph.txt, but NCCL 2.5 does not take it into account (while 2.7 does). That feature appeared in 2.6.