CUDA&HIP stream asynchronicity

DejvBayer commented 6 months ago

Hi,

this is a snippet of launch of a CUDA kernel from DispatchPlan module.

...

if (app->configuration.num_streams >= 1) {
    result = cuLaunchKernel(axis->VkFFTKernel, ..., app->configuration.stream[app->configuration.streamID], args, 0);
}
else {
    result = cuLaunchKernel(axis->VkFFTKernel, ..., 0, args, 0);
}

// result check

if (app->configuration.num_streams > 1) {
    app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams;
    if (app->configuration.streamCounter == 0) {
        cudaError_t res2 = cudaEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]);
        if (res2 != cudaSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD;
    }
    app->configuration.streamCounter++;
}

...

I do not understand several things about this code:

Why is the kernel launched every time into different stream? I see that in the RunApp module you call VkFFTSync after each kernel launch. I think that it is not necessary unless you want to execut the work in parallel.
Is it correct that only the event at index 0 is ever recorded to a stream because streamCount? It seems more like a mistake.

Then here is a snippet from a VkFFTSync function.

...

if (app->configuration.num_streams > 1) {
  cudaError_t res = cudaSuccess;
  for (pfUINT s = 0; s < app->configuration.num_streams; s++) {
      res = cudaEventSynchronize(app->configuration.stream_event[s]);
      if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
  }
  app->configuration.streamCounter = 0;
}

...

Here is the synchronization of multiple CUDA streams. If I am not wrong, the it synchronizes events that were never launched into a stream. Also it makes the application synchronous, I guess that cudaStreamWaitEvent function would be more suitable in this case.

But overall I feel like that the whole design of using multiple streams is wrong. What I think is right would be:

When the plan is created, same number of events as is the stream count should be created.
Then when the VkFFTAppend function is called this should happen:
1. Events should be recorded into each except the first stream via cudaEventRecord.
2. The first stream should wait for all of the work in other streams to finish by calling cudaStreamWaitEvent on each except the first event.
3. All of the work should be launched into the first stream.
4. When everything is done, the first event shall be recorded into the first stream via cudaEventRecord
5. All of the streams except the first one shall call cudaStreamWaitEvent on the first event.
The user launch more work into the streams.

This attitude should work fine and even allow the usage of CUDA Graphs via stream capture. HIP has the exact same story.

Thanks!

David

DTolm commented 6 months ago

Hello,

multiple streams was a test to mimic the Vulkan behavior of shader dispatches to the pipeline, where unless synchronized they launch without waiting for completion of the last shader - unlike the kernel model of CUDA, where kernels wait for previous kernels. The usability of it turned out to be very limited - only if there are multiple dispatches of kernel when the grid dimensions go out device limits (65k for y and z). However, these workloads are typically big and utilize GPU fully by themselves with low CPU overhead, so using multiple streams was not useful at all. I think you are correct that the synchronization is messed up currently for this version, I will need to check in detail your changes when I have more time.

Best regards, Dmitrii

DejvBayer commented 6 months ago

Sure, the mechanism is described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

It is just extended to work with arbitrary number of streams.

David

DTolm / VkFFT

CUDA&HIP stream asynchronicity #163