Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:
// Semantically we want these dependencies for the kernels launched:
// 1. Launch host task on hostStream.
// 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...}
// 3. {deviceStream, userStream[i]...} depend on kernel.
// We achieve this by:
// 1. userStream[0] waits on deviceStream
// 2. deviceStream waits on each of userStream[1...]
// 3. host task launch on hostStream
// 4. userStream[0] waits on hostStream
// 5. kernel launch on userStream[0]
// 6. deviceStream waits on userStream[0]
// 7. userStream[1...] each waits on deviceStream
// The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires
// at least one of the two streams to be strong-stream.
I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.
Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:
// Semantically we want these dependencies for the kernels launched: // 1. Launch host task on hostStream. // 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...} // 3. {deviceStream, userStream[i]...} depend on kernel. // We achieve this by: // 1. userStream[0] waits on deviceStream // 2. deviceStream waits on each of userStream[1...] // 3. host task launch on hostStream // 4. userStream[0] waits on hostStream // 5. kernel launch on userStream[0] // 6. deviceStream waits on userStream[0] // 7. userStream[1...] each waits on deviceStream // The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires // at least one of the two streams to be strong-stream.
I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.
Thanks for your reply in advance.