NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Scheduling to Packet Sending Connection from ncclEnqueueCheck to finally sending the packet in misc/socket.cc::socketProgressOpt #1370

Open y344shi opened 4 months ago

y344shi commented 4 months ago

Hi NCCL team,

Our researching team and I have been exploring the Nvidia/NCCL library in the aspect of CPU communication tasks scheduling and GPU communication execution for network performance benchmarking.

We've so far noticed the queueing of collective tasks in ncclEnqueueCheck and also noticed the final step of sending the packet in misc/socket.cc::socketProgressOpt during the ncclDevKernel execution, on a packet-by-packet basis, and the netTransport implementation of various send functions, but were unable to bridge the intermediate mechanisms. Could anyone provide some explanation or point me towards relevant documentation or code comments that elaborate on the following aspects?

  1. The dequeue of appended workflow by ncclEnqueueCheck in collective scheduling tasks.

We noticed communication are scheduled by ncclEnqueues, then carried asynchronous in later times. From taskAppend at the scheduling side of communications, it would be great if we could be pointed at which point such tasks are dispatched to communication executions.

  1. The communication caller of endpoint misc/socket.cc::socketProgressOpt <- ncclNetSocketTest <- sendProxyProgress

We also observed the socket endpoint where packets are dispatched in chunks, however, we are having difficulty following it up the stack, where, if any, were the master dispatcher of such send events.

All Appreciated,

Best regards,