NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.17k stars 802 forks source link

NCCL Crashes when do NET initialization #1091

Open yanminjia opened 10 months ago

yanminjia commented 10 months ago

NCCL crashes. And here is the call stack by loading the core dump file to gdb. It looks it is caused by the NET plugin lib (libnccl-net.so).

gdb) bt

0 0x0000000000000000 in ?? ()

1 0x00007fefc64b890f in nccl_p2p_ib_init (num_devs=0x7fefc64cca38 , ncclIbDevs=, ncclIbIfName=0x7fefc64ef090 "ibs22", ncclIbIfAddr=0x7fefc64ef070 ,

ncclIbAsyncThread=0x7fefc64ef020 <ncclIbAsyncThread>, logFunction=<optimized out>) at p2p_plugin.c:315

2 0x00007ff2369daf80 in ncclNet_v6_as_v7_init (logfn=) at net.cc:54

3 0x00007ff2369db85a in netGetState (state=, i=0) at net.cc:322

4 ncclNetInit (comm=comm@entry=0x557d0e33d530) at net.cc:351

5 0x00007ff2369ca19c in commAlloc (comm=comm@entry=0x557d0e33d530, parent=parent@entry=0x0, ndev=, rank=) at init.cc:334

6 0x00007ff2369d8ef8 in ncclCommInitRankFunc (job_=0x557d0e341290) at init.cc:1387

7 0x00007ff2369c661c in ncclAsyncJobMain (arg=0x557d0e341290) at group.cc:62

8 0x00007ff2364d7ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442

9 0x00007ff236569a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

It looks this problem only happens if the number of the NICs installed on the server exceeds 16. When I get one NIC down, it can work. It would be highly appreciated if any idea. Thanks.

xxxx@xxxx:~$ ibdev2netdev mlx5_0 port 1 ==> ens13f0np0 (Up) mlx5_1 port 1 ==> ens13f1np1 (Up) mlx5_10 port 1 ==> ens18f0np0 (Up) mlx5_11 port 1 ==> ens18f1np1 (Up) mlx5_12 port 1 ==> ibs22 (Up) mlx5_13 port 1 ==> ens16f0np0 (Up) mlx5_14 port 1 ==> ens16f1np1 (Up) mlx5_15 port 1 ==> ens15f0np0 (Up) mlx5_16 port 1 ==> ens15f1np1 (Up) mlx5_2 port 1 ==> ens14f0np0 (Up) mlx5_3 port 1 ==> ens14f1np1 (Up) mlx5_4 port 1 ==> ens12f0np0 (Up) mlx5_5 port 1 ==> ens12f1np1 (Up) mlx5_6 port 1 ==> ens11f0np0 (Up) mlx5_7 port 1 ==> ens11f1np1 (Up) mlx5_8 port 1 ==> ens17f0np0 (Up) mlx5_9 port 1 ==> ens17f1np1 (Up)

sjeaugey commented 10 months ago

What version of NCCL are you using? We've increased the maximum to 32 some time ago. Can you check with a newer version of NCCL?

If you don't want to upgrade, as a workaround, you can use NCCL_IB_HCA==mlx5_0,mlx5_1,... to restrict NCCL to the interfaces you really need.

yanminjia commented 10 months ago

Many thanks for your prompt response. we are using nccl 2.19.3. It looks the problem is caused by libnccl-net.so but not nccl code.