Mellanox / nv_peer_memory

309 stars 62 forks source link

nccl test failed when using gdr #78

Open wangshaochuang opened 3 years ago

wangshaochuang commented 3 years ago

disable p2p and shm for network test

./all_reduce_perf -g 2 nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1

Using devices Rank 0 Pid 85517 on k69a05298 device 0 [0x52] A100-SXM4-40GB Rank 1 Pid 85517 on k69a05298 device 1 [0x57] A100-SXM4-40GB k69a05298:85517:85517 [0] NCCL INFO Bootstrap : Using [0]bond0:100.82.131.167<0> [1]bond1:11.22.33.61<0> [2]bond2:11.22.33.62<0> [3]bond3:11.22.33.63<0> [4]bond4:11.22.33.64<0> [5]bond5:11.22.33.65<0> [6]bond6:11.22.33.66<0> [7]bond7:11.22.33.67<0> [8]bond8:11.22.33.68<0> [9]br-bb9003a7ecb2:192.168.10.1<0> k69a05298:85517:85517 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation k69a05298:85517:85517 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_8:1/RoCE [1]mlx5_bond_7:1/RoCE [2]mlx5_bond_6:1/RoCE [3]mlx5_bond_5:1/RoCE [4]mlx5_bond_4:1/RoCE [5]mlx5_bond_3:1/RoCE [6]mlx5_bond_2:1/RoCE [7]mlx5_bond_1:1/RoCE [8]mlx5_bond_0:1/RoCE ; OOB bond0:100.82.131.167<0> k69a05298:85517:85517 [0] NCCL INFO Using network IB NCCL version 2.7.8+cuda11.0 k69a05298:85517:85801 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC k69a05298:85517:85801 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. k69a05298:85517:85802 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 k69a05298:85517:85801 [0] NCCL INFO Channel 00/02 : 0 1 k69a05298:85517:85801 [0] NCCL INFO Channel 01/02 : 0 1 k69a05298:85517:85802 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1 k69a05298:85517:85802 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff,ffffffff k69a05298:85517:85801 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 k69a05298:85517:85801 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1 k69a05298:85517:85801 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff,ffffffff k69a05298:85517:85802 [1] NCCL INFO Channel 00 : 0[52000] -> 1[57000] [receive] via NET/IB/6/GDRDMA k69a05298:85517:85801 [0] NCCL INFO Channel 00 : 1[57000] -> 0[52000] [receive] via NET/IB/7/GDRDMA k69a05298:85517:85802 [1] NCCL INFO Channel 00 : 1[57000] -> 0[52000] [send] via NET/IB/6/GDRDMA k69a05298:85517:85801 [0] NCCL INFO Channel 00 : 0[52000] -> 1[57000] [send] via NET/IB/7/GDRDMA k69a05298:85517:85802 [1] NCCL INFO Channel 01 : 0[52000] -> 1[57000] [receive] via NET/IB/6/GDRDMA k69a05298:85517:85801 [0] NCCL INFO Channel 01 : 1[57000] -> 0[52000] [receive] via NET/IB/7/GDRDMA k69a05298:85517:85802 [1] NCCL INFO Channel 01 : 1[57000] -> 0[52000] [send] via NET/IB/6/GDRDMA k69a05298:85517:85801 [0] NCCL INFO Channel 01 : 0[52000] -> 1[57000] [send] via NET/IB/7/GDRDMA k69a05298:85517:85802 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer k69a05298:85517:85801 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer k69a05298:85517:85801 [0] NCCL INFO comm 0x7f6effc59800 rank 0 nranks 2 cudaDev 0 busId 52000 - Init COMPLETE k69a05298:85517:85802 [1] NCCL INFO comm 0x7f6ef0000b60 rank 1 nranks 2 cudaDev 1 busId 57000 - Init COMPLETE

                                                 out-of-place                       in-place
   size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

k69a05298:85517:85517 [0] NCCL INFO Launch mode Group/CGMD mlx5: k69a05298.eu95sqa: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 02005104 080037a1 0000e4d2 mlx5: k69a05298.eu95sqa: got completion with error:

k69a05298:85517:85837 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81 00000000 00000000 00000000 00000000 k69a05298:85517:85837 [0] NCCL INFO include/net.h:28 -> 2 00000000 00000000 00000000 00000000 k69a05298:85517:85837 [0] NCCL INFO transport/net.cc:310 -> 2 00000003 00000000 00000000 00000000 k69a05298:85517:85837 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread] 00000000 02005104 08003c03 00004ed2

k69a05298:85517:85836 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 4, opcode 1, len 32622, vendor err 81 k69a05298:85517:85836 [0] NCCL INFO include/net.h:28 -> 2 k69a05298:85517:85836 [0] NCCL INFO transport/net.cc:310 -> 2 k69a05298:85517:85836 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]

Environment:DGX-2 Hardware: GPU A100 NIC Mellanox CX5 nvdriver version 450.51.05 cuda version 11.0 ofed version 5.0 nccl version 2.7.8

OasisArtisan commented 3 years ago

I'm a user like you but I had the same problem and I solved it by disabling PCIe ACS.

I got my information from this issue which seems to match your problem. https://github.com/NVIDIA/nccl/issues/214

And this https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/13 explains how to disable PCIe ACS

I'm not an expert so take my suggestion with a grain of salt.