Question about hostStream, deviceStream and userStream

Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:

// Semantically we want these dependencies for the kernels launched: // 1. Launch host task on hostStream. // 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...} // 3. {deviceStream, userStream[i]...} depend on kernel. // We achieve this by: // 1. userStream[0] waits on deviceStream // 2. deviceStream waits on each of userStream[1...] // 3. host task launch on hostStream // 4. userStream[0] waits on hostStream // 5. kernel launch on userStream[0] // 6. deviceStream waits on userStream[0] // 7. userStream[1...] each waits on deviceStream // The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires // at least one of the two streams to be strong-stream.

I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.

Thanks for your reply in advance.

NVIDIA / nccl

Question about hostStream, deviceStream and userStream #1476