[RFE] Use cudaLaunchKernel instead of <<<>>>

benbarsdell commented 3 years ago

The main reason for this request is to improve error handling. When using <<<>>>, CUB currently has to call cudaPeekAtLastError after the launch to check for invalid configuration errors. However, this API also returns invalid configuration errors from previous launches. If cudaLaunchKernel is used instead then its return value can be checked directly and it is unaffected by previous invalid configuration errors.

There is also a small performance benefit to using cudaLaunchKernel.

The only downside is that cudaLaunchKernel cannot perform template type deduction or implicit argument conversions. However, type safety can be achieved using a wrapper like this:

template <typename... Args>
cudaError_t cuda_launch_kernel(void (*kernel)(Args...), dim3 grid, dim3 block,
                               size_t smem, cudaStream_t stream,
                               const Args&... args) {
  void* arg_ptrs[] = {(void*)&args...};
  return cudaLaunchKernel((void*)kernel, grid, block, arg_ptrs, smem, stream);
}

jrhemstad commented 3 years ago

Doesn't cudaLaunchKernel have the exact same problem? It can return error codes from previous async calls:

Note that this function may also return error codes from previous, asynchronous launches.

maddyscientist commented 3 years ago

I think the difference is that it only returns sticky error codes from prior launches, but harmless error codes are not returned. E.g, an out of bounds write would flag illegal address and that would return in a subsequent cudaLaunchKernel, but a failed kernel launch due too many resources being used would not.

alliepiper commented 3 years ago

I'm tentatively slating this for 1.14. I'm in the middle of rewriting the kernel dispatch mechanisms Thrust and CUB, and eventually we'll port the triple_chevron_launcher class from Thrust to CUB, which already has similar code to repack the arguments for cudaLaunchDevice CDP launches. We may be able to reuse that logic here.

alliepiper commented 3 years ago

It was just pointed out in NVBug 200715408 that this will also WAR some issues related to templated kernels and shared libraries that result in corrupted cudart state. Bumping up the priority.

NVIDIA / cccl

[RFE] Use cudaLaunchKernel instead of <<<>>> #874