NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

AllReduce Root Always 0 #1238

Open gjit-juniper opened 5 months ago

gjit-juniper commented 5 months ago

Hello, I have been running some communication benchmarks (ex. NCCL-tests) to test NCCL. On generating NCCL logs for the same, I observed that the "root" field always prints out "0" as shown in the log snippet below. I have noticed the same behavior while running other benchmarks such as PARAM. Can anyone point out if this is expected behavior? If so, what would be the explanation for it?

ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140 ip-172-31-54-74:30125:30125 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fc9c4000000 recvbuff 0x7fc9bc000000 count 33554432 datatype 7 op 0 root 0 comm 0x5631f0816b40 [nranks=1] stream 0x5631f0021140

Additionally, I have tried to log the communication between ranks for the collectives by printing out the global and local rank of the peer but even in that case it gives me 0 for both as shown below in the log snippet. Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?

ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 69 sendbuff 0x7f2ae00000c0 recvbuff 0x7f2ad80000c0 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6a sendbuff 0x7f2ae00000e0 recvbuff 0x7f2ad80000e0 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6b sendbuff 0x7f2ae0000100 recvbuff 0x7f2ad8000100 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6c sendbuff 0x7f2ae0000120 recvbuff 0x7f2ad8000120 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6d sendbuff 0x7f2ae0000140 recvbuff 0x7f2ad8000140 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6e sendbuff 0x7f2ae0000160 recvbuff 0x7f2ad8000160 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 6f sendbuff 0x7f2ae0000180 recvbuff 0x7f2ad8000180 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0] ip-172-31-76-215:3725:3725 [2] NCCL INFO AllReduce: opCount 70 sendbuff 0x7f2ae00001a0 recvbuff 0x7f2ad80001a0 count 8 datatype 7 op 0 root 0 comm 0x5607541d36a0 [nranks=8] stream 0x560753f5c3b0 2[2] -> 0[0]

sjeaugey commented 5 months ago

I observed that the "root" field always prints out "0"

That's because allreduce doesn't have a "root" argument so prints which dump operations will usually set it to 0, which stands for "N/A".

Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?

All ranks from the communicator need to call into allreduce. If what you want to know is who is printing the line, you can use the beginning of the line to identify the process. E.g.:

ip-172-31-76-215:3725:3725 [2]
   node name    :pid :tid  [GPU NVML index]
gjit-juniper commented 5 months ago

I observed that the "root" field always prints out "0"

That's because allreduce doesn't have a "root" argument so prints which dump operations will usually set it to 0, which stands for "N/A".

Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?

All ranks from the communicator need to call into allreduce. If what you want to know is who is printing the line, you can use the beginning of the line to identify the process. E.g.:

ip-172-31-76-215:3725:3725 [2]
   node name    :pid :tid  [GPU NVML index]

Thanks for your reply! I'm looking to check between which ranks are communicating with each other during the AllReduce operation. If all the ranks call into the AllReduce as you mentioned, then does that mean all the ranks (i.e. nRanks) are communicating with the GPU NVML index? Or are the ranks communicating with the given GPU NVML Index dependent on the topology NCCL uses?