Hello, I used the INFO() debug API in nccl to trace the ring/tree channels computed for 4 GPUS, each on a separate node. The INFO() I want to consult is within the sendSetup() and recvSetup() in transport/net.cc. I found that for ring channels, the ring builds well. But for tree channels, the results builds as the following picture shows (The arrow points from sender to receiver from the INFO debug information I gathered), with no send from rank 1 to rank 2, rank 2 to rank 3 in tree channel 0, and no send from rank 0 to rank 1, rank 1 to rank 2 in tree channel 1. But for the double binary tree, I think all the ranks in the tree can send/recv from other ranks connected. Is my debug results normal or abnormal? Thanks a lot!
Here is the detailed debug information I obtained:
nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Connected all rings
nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [receive] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [send] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [send] via NET/Socket/0
nid03084:31364:31364 [0] NCCL INFO Connected all trees
[MPI Rank 3] Success
nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Connected all rings
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [receive] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [send] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [send] via NET/Socket/0
nid03082:14538:14538 [0] NCCL INFO Connected all trees
[MPI Rank 1] Success
nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Connected all rings
nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [receive] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [send] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [receive] via NET/Socket/0
nid03081:9116:9116 [0] NCCL INFO Connected all trees
[MPI Rank 0] Success
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Connected all rings
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [receive] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [send] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [receive] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0
nid03083:12651:12651 [0] NCCL INFO Connected all trees
[MPI Rank 2] Success
Hello, I used the INFO() debug API in nccl to trace the ring/tree channels computed for 4 GPUS, each on a separate node. The INFO() I want to consult is within the sendSetup() and recvSetup() in transport/net.cc. I found that for ring channels, the ring builds well. But for tree channels, the results builds as the following picture shows (The arrow points from sender to receiver from the INFO debug information I gathered), with no send from rank 1 to rank 2, rank 2 to rank 3 in tree channel 0, and no send from rank 0 to rank 1, rank 1 to rank 2 in tree channel 1. But for the double binary tree, I think all the ranks in the tree can send/recv from other ranks connected. Is my debug results normal or abnormal? Thanks a lot!
Here is the detailed debug information I obtained:
nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Connected all rings nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [receive] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [send] via NET/Socket/0 nid03084:31364:31364 [0] NCCL INFO Connected all trees [MPI Rank 3] Success
nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Connected all rings nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 3[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 3[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [receive] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [send] via NET/Socket/0 nid03082:14538:14538 [0] NCCL INFO Connected all trees [MPI Rank 1] Success
nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 3[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 0[2000] -> 1[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Connected all rings nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [send] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Channel 01/0 : 1[2000] -> 0[2000] [receive] via NET/Socket/0 nid03081:9116:9116 [0] NCCL INFO Connected all trees [MPI Rank 0] Success
nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 1[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 3[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Connected all rings nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 0[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 0[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 3[2000] -> 2[2000] [receive] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 00/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Channel 01/0 : 2[2000] -> 1[2000] [send] via NET/Socket/0 nid03083:12651:12651 [0] NCCL INFO Connected all trees [MPI Rank 2] Success