NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 821 forks source link

NCCL Collective Log Query #1470

Open gjit-juniper opened 1 month ago

gjit-juniper commented 1 month ago

Hello,

I have been going through the logging functionality in NCCL and wanted to know if there is a way to determine the global ranks of the devices that are involved in a collective operation. Currently, the total number of ranks (i.e. nRanks) that are involved in the operation (ex. AllReduce, Broadcast, etc.) is displayed in the logs. I wanted to know if there is way to also get the information of the global ranks of all the GPUs involved in the communicator.

Thanks, Jit.

sjeaugey commented 1 month ago

That would be complicated, given NCCL has no notion of "COMM WORLD" (which also allows NCCL to skrink/grow and work on fault tolerance).

One easy solution though is to save the output of each rank into a different file.

gjit-juniper commented 1 month ago

Thanks for the response @sjeaugey ! Is there a way to log what is the next rank in the given topology without dumping the topology graph? For example, if the log line says an AllReduce op is carried out (and the log line is generated by rank X), is there a way to log the next rank where data is being sent to, from X?

Currently the log line says the following: hostname:4173:4655 [3] NCCL INFO AllReduce: opCount 706 sendbuff 0x7fbe13a00000 recvbuff 0x7fbe13a00000 count 1 datatype 1 op 0 root 0 comm 0xe574d50 [nranks=8] stream 0xe578da0

which, to my understanding, says that the AllReduce is being currently carried out on rank 3 currently. I would like to know if we can also log the next rank to which the reduced data is being sent to from rank 3.