Open gjit-juniper opened 5 months ago
I observed that the "root" field always prints out "0"
That's because allreduce doesn't have a "root" argument so prints which dump operations will usually set it to 0, which stands for "N/A".
Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?
All ranks from the communicator need to call into allreduce. If what you want to know is who is printing the line, you can use the beginning of the line to identify the process. E.g.:
ip-172-31-76-215:3725:3725 [2]
node name :pid :tid [GPU NVML index]
I observed that the "root" field always prints out "0"
That's because allreduce doesn't have a "root" argument so prints which dump operations will usually set it to 0, which stands for "N/A".
Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?
All ranks from the communicator need to call into allreduce. If what you want to know is who is printing the line, you can use the beginning of the line to identify the process. E.g.:
ip-172-31-76-215:3725:3725 [2] node name :pid :tid [GPU NVML index]
Thanks for your reply! I'm looking to check between which ranks are communicating with each other during the AllReduce operation. If all the ranks call into the AllReduce as you mentioned, then does that mean all the ranks (i.e. nRanks) are communicating with the GPU NVML index? Or are the ranks communicating with the given GPU NVML Index dependent on the topology NCCL uses?
Hello, I have been running some communication benchmarks (ex. NCCL-tests) to test NCCL. On generating NCCL logs for the same, I observed that the "root" field always prints out "0" as shown in the log snippet below. I have noticed the same behavior while running other benchmarks such as PARAM. Can anyone point out if this is expected behavior? If so, what would be the explanation for it?
Additionally, I have tried to log the communication between ranks for the collectives by printing out the global and local rank of the peer but even in that case it gives me 0 for both as shown below in the log snippet. Is there a way to obtain the information of ranks on which the current collective (ex.Allreduce) is being performed?