NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

bus error #177

Closed bltcn closed 9 months ago

bltcn commented 10 months ago

nvidia-smi topo -m

    GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    CPU Affinity   NUMA Affinity    GPU NUMA ID

GPU0 X NODE SYS SYS NODE NODE SYS SYS 0-19,40-59 0 N/A GPU1 NODE X SYS SYS NODE NODE SYS SYS 0-19,40-59 0 N/A GPU2 SYS SYS X NODE SYS SYS NODE NODE 20-39,60-79 1 N/A GPU3 SYS SYS NODE X SYS SYS NODE NODE 20-39,60-79 1 N/A NIC0 NODE NODE SYS SYS X PIX SYS SYS NIC1 NODE NODE SYS SYS PIX X SYS SYS NIC2 SYS SYS NODE NODE SYS SYS X PIX NIC3 SYS SYS NODE NODE SYS SYS PIX X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3

NCCL_DEBUG=INFO NCCL_P2P_DIRECT_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128m -f 2 -g 4

nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 269 on 0d1b18e60c40 device 0 [0x31] NVIDIA A10

Rank 1 Group 0 Pid 269 on 0d1b18e60c40 device 1 [0x4b] NVIDIA A10

Rank 2 Group 0 Pid 269 on 0d1b18e60c40 device 2 [0x98] NVIDIA A10

Rank 3 Group 0 Pid 269 on 0d1b18e60c40 device 3 [0xb1] NVIDIA A10

0d1b18e60c40:269:269 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0> 0d1b18e60c40:269:269 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory 0d1b18e60c40:269:269 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation 0d1b18e60c40:269:269 [3] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.5+cuda12.2

0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed

0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_0

0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed

0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_1

0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed

0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_2

0d1b18e60c40:269:277 [0] misc/ibvwrap.cc:94 NCCL WARN Call to ibv_open_device failed

0d1b18e60c40:269:277 [0] transport/net_ib.cc:193 NCCL WARN NET/IB : Unable to open device mlx5_3 0d1b18e60c40:269:277 [0] NCCL INFO NET/IB : No device found. 0d1b18e60c40:269:277 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0> 0d1b18e60c40:269:277 [0] NCCL INFO Using network Socket 0d1b18e60c40:269:280 [3] NCCL INFO Using network Socket 0d1b18e60c40:269:278 [1] NCCL INFO Using network Socket 0d1b18e60c40:269:279 [2] NCCL INFO Using network Socket 0d1b18e60c40:269:279 [2] NCCL INFO comm 0x56343d7c46e0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 98000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:278 [1] NCCL INFO comm 0x56343d7c0190 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:277 [0] NCCL INFO comm 0x56343d7b9f20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 31000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:280 [3] NCCL INFO comm 0x56343d7c8bc0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId b1000 commId 0xc03500808602777f - Init START 0d1b18e60c40:269:280 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000 0d1b18e60c40:269:280 [3] NCCL INFO NVLS multicast support is not available on dev 3 0d1b18e60c40:269:278 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff 0d1b18e60c40:269:278 [1] NCCL INFO NVLS multicast support is not available on dev 1 0d1b18e60c40:269:279 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000 0d1b18e60c40:269:279 [2] NCCL INFO NVLS multicast support is not available on dev 2 0d1b18e60c40:269:277 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 0d1b18e60c40:269:277 [0] NCCL INFO NVLS multicast support is not available on dev 0 0d1b18e60c40:269:277 [0] NCCL INFO Channel 00/02 : 0 1 2 3 0d1b18e60c40:269:280 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 0d1b18e60c40:269:279 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 0d1b18e60c40:269:280 [3] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:277 [0] NCCL INFO Channel 01/02 : 0 1 2 3 0d1b18e60c40:269:278 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 0d1b18e60c40:269:277 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 0d1b18e60c40:269:278 [1] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:277 [0] NCCL INFO P2P Chunksize set to 131072 0d1b18e60c40:269:279 [2] NCCL INFO P2P Chunksize set to 131072 Bus error (core dumped)

What do I need to configure to use NCCL?

sjeaugey commented 10 months ago

Looks like you didn't configure your container correctly to be able to use the Infiniband interfaces. You also didn't provide enough shared memory (causing the bus error).

bltcn commented 10 months ago

thanks. I will try.

bltcn commented 9 months ago

use --shm-size 16g solve this problem.thanks