Open ianna opened 3 months ago
duplicate of https://github.com/NVIDIA/numba-cuda/issues/36
Reopening this as it's a much better-written request than mine in #36. My notes from #36:
__array_ufunc__
mechanism seems CPU-centric - will we need to define a new __cuda_array_ufunc__
mechanism for this to be practical? c.f. __cuda_array_interface__
vs. __array_interface__
.Awesome! Thank you both for getting this started.
An off the cuff two cents: but wouldn't __cuda_array_interface__
+ __array_ufunc__
imply __cuda_array_ufunc__
, so a new interface may not be entirely necessary?
Dispatch to other ufuncs is already handled quite well by awkward. i.e. an awkward array with cuda
backend can be straightforwardly passed to np.abs
(or similar) and it correctly dispatches to cp.abs
.
Just to bump this - has there been any further thought in this direction?
@lgray - thanks for the bump - I haven't been able to look into this further yet as I don't have enough of a grasp of the concepts to sketch out an implementation plan further without spending time doing some research... I think you might be a bit more ahead of me in your thinking about this - do you have some thoughts about what the implementation should / could look like?
@gmarkall I don't really have recommendations on low level implementation, but I do know how we would like things to operate from a high level.
Essentially we'd like our data scientists (high energy particle physics experiment scientists and PhD students) to we able to design analyses on their laptops for CPU and redeploy it with a few configuration changes on GPU using awkward array.
We can detect when data is on GPU vs. CPU and switch between kernels automatically and easily with awkward, so that's largely a matter of user interface.
What we need from numba-cuda is for it to interact seamlessly with awkward arrays that are on-device as well as it already does with host-side arrays and regular numba.
Training users to write effective cuda kernels with numba is a different matter entirely that we will not touch here. I'm just considering pretty simple ufuncs that you get through @vectorize
.
So really, on the backend we just need it to be able to identify ufuncs and then to be able to distinguish when those ufuncs accept device side arrays. So I think some scaffolding is missing, and not much else, essentially to smoothen the experience on the user side? I'm not quite aware of the entry points to change things to give it a shot, off the top of my head.
@gmarkall I was talking to @jpivarski and @ianna last week and I hadn't realized that cupy itself didn't implement a nep13-like protocol when calling the cupy version of the ufunc. So this makes it clear why this had problems working in the first place and now I agree that we need something like nep13 so that awkward can detect and override the application of cupy specific ufuncs. Then we can use that with numba and we're in a much better place.
Is this a more accurate understand of the situation from my side?
It sounds like you've mapped out the issue and what we need to do to resolve it a bit further - I still don't have any expertise in this area so I can't comment definitively, but what you said makes sense and it seems to give us more understanding of the situation.
Do we need to have a feature request in CuPy for NEP13 or some NEP13-like support?
Yes, I think we need to ask @leofang. I will open an issue on CuPy github.
Is your feature request related to a problem? Please describe. Array-like objects that define an
__array_ufunc__
method (NEP-13) can be used withufuncs
created bynp.vectorize
as follows:we would like to have similar functionality on CUDA to allow the following:
Describe the solution you'd like Maybe this function is missing
__array_ufunc__
handling? https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/deviceufunc.py#L241-L329Describe alternatives you've considered If we wrap in a
flatten
/unflatten
we are able to get this to work, which is a bit clunky.Additional info Version of Awkward Array is 2.6.6 Code to reproduce:
resulting in the output: