Open gim4moon opened 6 days ago
We need more info. What are the GPUs? What is the interconnect? The output of nvidia-smi
and nvidia-smi topo -m
from one of the nodes would be nice, as would a dump of the topology detected by NCCL. Can you include the NCCL debug output (from just one of the ranks, please! 😃), especially since you collect it already? It might be worth adding TUNING
to the list of subsystems to debug...
We need more info. What are the GPUs? What is the interconnect? The output of
nvidia-smi
andnvidia-smi topo -m
from one of the nodes would be nice, as would a dump of the topology detected by NCCL. Can you include the NCCL debug output (from just one of the ranks, please! 😃), especially since you collect it already? It might be worth addingTUNING
to the list of subsystems to debug...
The node is Dell XE9680.
The GPU is H100 x 8EA per node.
The Infiniband has connectX-7 x 4EA VPI card (mlx5_0:1, mlx5_1:1, mlx5_2:1, mlx5_3:1) per node and 200G ethernet cards x 2EA (bonding configuration).
The topology is GPU to GPU connected with NV18, and GPU to NIC connected with PIX.
I'm sorry I can't provide the original nvidia-smi and topo!
I appreciate your help as much as possible.
Hello
Currently, our client company is supporting nccl-test.
We are supporting it by writing the script below.
mpirun -np 300 -N 1 -x NCCL_DEBUG=INFO --hostfile /nccl/hostfile \ -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 512 \ --bind-to none -mca btl tcp,self -mca coll_hcoll_enable 0 \ -x NCCL_SOCKET_IFNAME=bond0 \ -x NCCL_IB_AR_THRESHOLD=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x NCCL_IB_SPLIT_DATA_ON_QPS=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x PATH -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ -x NCCL_NET_GDR_READ=1 -x NCCL_IGNORE_CPU_AFFINITY=1 -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_DEBUG_SUBSYS=NET \ /nccl/nccl-tests/build/all_reduce_perf -b 512 -e 8G -f 2 -g 8
The max busbw is only 14GB/s
Is there something wrong with the command? Please help me.