NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.96k stars 756 forks source link

The variable NCCL_IB_ADDR_RANGE did not work properly after being configured #1332

Open riverzhang opened 6 days ago

riverzhang commented 6 days ago

Some software versions: nccl test : 2.13.9 openmpi: 4.1.5 rdma ofed: 23.10-1.1.9.0 nvidia-dirver: 535.104.12-1 cuda: 11.4.4-1 nccl: 2.21.5-1

Command mpirun --allow-run-as-root -bind-to none -map-by ppr:4:node -np 8 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -H xxxxx:4,xxxxx:4 -x NCCL_NVLS_ENABLE=0 -x NCCL_IB_HCA=mlx5_0,mlx5_1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_ADDR_RANGE=192.168.64.0/24 -x NCCL_IB_ADDR_FAMILY=AF_INET -x NCCL_IB_ROCE_VERSION_NUM=2 -x NCCL_DEBUG=INFO -x NCCL_IB_TC=160 -mca btl_tcp_if_include eth0 ./build/all_reduce_perf -b 256M -e 4G -f 2 -g 1

error log: busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_FAMILY set by environment to AF_INET busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2. busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_RANGE set by environment to 192.168.64.0/24 busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:282 -> 2 busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:305 -> 2 busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:1047 -> 2 busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net.cc:687 -> 2 busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport/net.cc:306 -> 2 busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport.cc:165 -> 2 busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1263 -> 2 busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1548 -> 2 busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO group.cc:64 -> 2 [Async thread] busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:418 -> 2 busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:95 -> 2 busybox2-68df6c586-ntvlv: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. busybox2-68df6c586-ntvlv pid 11537: Test failure common.cu:844

riverzhang commented 2 days ago

@sjeaugey Hello,I'm looking at this problem of NCCL. Similar problems have been posted (like https://github.com/NVIDIA/nccl/issues/890 ) and I've tried the suggestions but it hasn't worked.

gcongiu commented 2 days ago

@riverzhang that looks like a problem with RoCE version detection. The code retrieves the RoCE version by reading it from /sys/class/infiniband/<device>/ports/<port_num>/gid_attrs/types/<gid_index>. The open (or read) call is failing for the above file and returning ncclSystemError (error code 2). Could you check if the path exists?

gcongiu commented 2 days ago

Could you apply this patch and rerun your tests? 0001-net_ib-add-warn-debug-for-RoCE-version-detection.patch