eyalroz / cuda-api-wrappers

Thin, unified, C++-flavored wrappers for the CUDA APIs
BSD 3-Clause "New" or "Revised" License
777 stars 80 forks source link

Adopting code to `cuda::kernel::wrap` API changes #486

Closed fvanmaele closed 1 year ago

fvanmaele commented 1 year ago

I'm having some old code which uses the old API of cuda::kernel::wrap, which was modified in commit bc53844ef215dc974c02adca35240ad5d83c68af. The call looks as follows:

            return cuda::kernel::wrap(
                cuda::device::current::detail_::get_id(),
                kernel::reduce<M_ic,
                               block_dim_ic,
                               num_partitions_per_block_ic,
                               num_rhs_batch_ic,
                               pivoting,
                               int,
                               T,
                               TR,
                               typename make_scalar_type<T>::type>);

kernel::reduce above is a void function. See: https://mp-force.ziti.uni-heidelberg.de/fvanmaele/tridigpu/-/blob/master/include/tridigpu/reduction.h#L58-80 and https://mp-force.ziti.uni-heidelberg.de/fvanmaele/tridigpu/-/blob/master/include/tridigpu/reduction.h#L105

How can I adopt this to the new interface? Unfortunately I am not familiar with this project, and I couldn't find a migration guide for the API either. Changing to cuda::kernel::get was unsuccessful.

eyalroz commented 1 year ago

tl;dr: cuda::kernel::get(my_device, my_plain_vanilla_kernel_function) is known to work; see the execution control example.


The effect of that big change on the wrap() method is that it now requires a CUDA context. People who are used to the runtime API don't know about context - and that's ok - but wrap() is the lowest-level non-detail_ kind of API call I offer, and it must really "not do anything", so it needs the context. Under the hood, the CUDA runtime API uses something called the "primary context" on each device.

So, your options are:

  1. Getting the handle of the primary context on the device (which is not entirely trivial since that thing is reference counted; but auto my_pcontext = my_device.primary_context(), keeping that object alive, and using my_pcontext.handle() does the trick.
  2. Using the convenience function template I offer to more runtime-API-oriented people, cuda::kernel::get(const device_t& device, KernelFunctionPtr function_ptr) where you only pass a device and a kernel function. That takes care of the context magic for you. Note: It takes a device, not a device id!
eyalroz commented 1 year ago

Assuming the question is answered to @fvanmaele 's satisfaction... please comment again if that's not the case.