NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Debug results for sendSetup() and recvSetup() #1344

Open ZhiyiHu1999 opened 4 months ago

ZhiyiHu1999 commented 4 months ago

Hello, I used the INFO() debug API in nccl to trace the ring/tree channels computed for 4 GPUS, each on a separate node. The INFO() I want to consult is within the sendSetup() and recvSetup() in transport/net.cc. I found that for ring channels, the ring builds well. But for tree channels, the results builds as the following picture shows (The arrow points from sender to receiver from the INFO debug information I gathered), with no send from rank 1 to rank 2, rank 2 to rank 3 in tree channel 0, and no send from rank 0 to rank 1, rank 1 to rank 2 in tree channel 1. But for the double binary tree, I think all the ranks in the tree can send/recv from other ranks connected. Is my debug results normal or abnormal? Thanks a lot! 8c08781ec68dd24a6f923b37a2bb8435

Here is the detailed debug information I obtained:

nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Connected all rings nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Connected all trees [MPI Rank 3] Success

nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Connected all rings nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Connected all trees [MPI Rank 1] Success

nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Connected all rings nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Connected all trees [MPI Rank 0] Success

nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Connected all rings nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Connected all trees [MPI Rank 2] Success