Closed bltcn closed 9 months ago
Looks like you didn't configure your container correctly to be able to use the Infiniband interfaces. You also didn't provide enough shared memory (causing the bus error).
thanks. I will try.
use --shm-size 16g solve this problem.thanks
nvidia-smi topo -m
GPU0 X NODE SYS SYS NODE NODE SYS SYS 0-19,40-59 0 N/A GPU1 NODE X SYS SYS NODE NODE SYS SYS 0-19,40-59 0 N/A GPU2 SYS SYS X NODE SYS SYS NODE NODE 20-39,60-79 1 N/A GPU3 SYS SYS NODE X SYS SYS NODE NODE 20-39,60-79 1 N/A NIC0 NODE NODE SYS SYS X PIX SYS SYS NIC1 NODE NODE SYS SYS PIX X SYS SYS NIC2 SYS SYS NODE NODE SYS SYS X PIX NIC3 SYS SYS NODE NODE SYS SYS PIX X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3
NCCL_DEBUG=INFO NCCL_P2P_DIRECT_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128m -f 2 -g 4
nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
Using devices
Rank 0 Group 0 Pid 269 on 0d1b18e60c40 device 0 [0x31] NVIDIA A10
Rank 1 Group 0 Pid 269 on 0d1b18e60c40 device 1 [0x4b] NVIDIA A10
Rank 2 Group 0 Pid 269 on 0d1b18e60c40 device 2 [0x98] NVIDIA A10
Rank 3 Group 0 Pid 269 on 0d1b18e60c40 device 3 [0xb1] NVIDIA A10
0d1b18e60c40:269:269 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0> 0d1b18e60c40:269:269 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory 0d1b18e60c40:269:269 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation 0d1b18e60c40:269:269 [3] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.5+cuda12.2
0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed
0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_0
0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed
0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_1
0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed
0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_2
0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed
0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_3 0d1b18e60c40:269:277 [0] NCCL INFO NET/IB : No device found. 0d1b18e60c40:269:277 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0> 0d1b18e60c40:269:277 [0] NCCL INFO Using network Socket 0d1b18e60c40:269:280 [3] NCCL INFO Using network Socket 0d1b18e60c40:269:278 [1] NCCL INFO Using network Socket 0d1b18e60c40:269:279 [2] NCCL INFO Using network Socket 0d1b18e60c40:269:279 [2] NCCL INFO comm 0x56343d7c46e0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 98000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:278 [1] NCCL INFO comm 0x56343d7c0190 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:277 [0] NCCL INFO comm 0x56343d7b9f20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 31000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:280 [3] NCCL INFO comm 0x56343d7c8bc0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId b1000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:280 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000 0d1b18e60c40:269:280 [3] NCCL INFO NVLS multicast support is not available on dev 3 0d1b18e60c40:269:278 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff 0d1b18e60c40:269:278 [1] NCCL INFO NVLS multicast support is not available on dev 1 0d1b18e60c40:269:279 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000 0d1b18e60c40:269:279 [2] NCCL INFO NVLS multicast support is not available on dev 2 0d1b18e60c40:269:277 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 0d1b18e60c40:269:277 [0] NCCL INFO NVLS multicast support is not available on dev 0 0d1b18e60c40:269:277 [0] NCCL INFO Channel 00/02 : 0 1 2 3 0d1b18e60c40:269:280 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 0d1b18e60c40:269:279 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 0d1b18e60c40:269:280 [3] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:277 [0] NCCL INFO Channel 01/02 : 0 1 2 3 0d1b18e60c40:269:278 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 0d1b18e60c40:269:277 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 0d1b18e60c40:269:278 [1] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:277 [0] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:279 [2] NCCL INFO P2P Chunksize set to 131072 Bus error (core dumped)
What do I need to configure to use NCCL?