[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc

ianna commented 3 months ago

Is your feature request related to a problem? Please describe. Array-like objects that define an __array_ufunc__ method (NEP-13) can be used with ufuncs created by np.vectorize as follows:

>>> import numba as nb
>>> @nb.vectorize()
... def _square(x):
...     return x * x
... 
>>> _square        
<numba._DUFunc '_square'>

we would like to have similar functionality on CUDA to allow the following:

>>> @nb.vectorize(
...     target="cuda",
... )
... def _square_cuda(x):
...     return x * x
... 
>>> _square_cuda
<numba.cuda.vectorizers.CUDAUFuncDispatcher object at 0x7259f82fa790>

Describe the solution you'd like Maybe this function is missing __array_ufunc__ handling? https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/deviceufunc.py#L241-L329

Describe alternatives you've considered If we wrap in a flatten/unflatten we are able to get this to work, which is a bit clunky.

import awkward as ak
import cupy as cp
import numba as nb

ak.numba.register_and_check()

@nb.vectorize()
def _square(x):
    return x * x

@nb.vectorize(
    target="cuda",
)
def _square_cuda(x):
    return x * x

def square_cuda_wrapped(x):
    counts = x.layout.offsets.data[1:] - x.layout.offsets.data[:-1]
    return ak.unflatten(cp.array(_square_cuda(ak.flatten(x))), counts)

counts = cp.random.poisson(lam=3, size=5000000)
flat_values = cp.random.normal(size=int(counts.sum()))

values = ak.unflatten(flat_values, counts)

values2_cpu = _square(ak.to_backend(values, "cpu"))

print(values2_cpu)

values2 = square_cuda_wrapped(values)

print(values2)

Additional info Version of Awkward Array is 2.6.6 Code to reproduce:

import awkward as ak
import cupy as cp
import numba as nb

ak.numba.register_and_check()

@nb.vectorize()
def _square(x):
    return x * x

@nb.vectorize(
    target="cuda",
)
def _square_cuda(x):
    return x * x

counts = cp.random.poisson(lam=3, size=50)
flat_values = cp.random.normal(size=int(counts.sum()))

values = ak.unflatten(flat_values, counts)

values2_cpu = _square(ak.to_backend(values, "cpu"))

print(values2_cpu)

values2 = _square_cuda(values)

print(values2)

resulting in the output:

[[0.045, 1.83], [1.55, 0.224, 0.621, ..., 1.24, 3.87], ..., [0.153, 0.0017]]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 33
     29 values2_cpu = _square(ak.to_backend(values, "cpu"))
     31 print(values2_cpu)
---> 33 values2 = _square_cuda(values)
     35 print(values2)

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/cuda/vectorizers.py:28, in CUDAUFuncDispatcher.__call__(self, *args, **kws)
     17 def __call__(self, *args, **kws):
     18     """
     19     *args: numpy arrays or DeviceArrayBase (created by cuda.to_device).
     20            Cannot mix the two types in one call.
   (...)
     26                   the input arguments.
     27     """
---> 28     return CUDAUFuncMechanism.call(self.functions, args, kws)

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:254, in UFuncMechanism.call(cls, typemap, args, kws)
    252 # Begin call resolution
    253 cr = cls(typemap, args)
--> 254 args = cr.get_arguments()
    255 resty, func = cr.get_function()
    257 outshape = args[0].shape

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:202, in UFuncMechanism.get_arguments(self)
    198 def get_arguments(self):
    199     """Prepare and return the arguments for the ufunc.
    200     Does not call to_device().
    201     """
--> 202     self._fill_arrays()
    203     self._fill_argtypes()
    204     self._resolve_signature()

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:100, in UFuncMechanism._fill_arrays(self)
     98     self.scalarpos.append(i)
     99 else:
--> 100     self.arrays[i] = np.asarray(arg)

File [~/coffea-gpu/awkward/src/awkward/highlevel.py:1434](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/highlevel.py#line=1433), in Array.__array__(self, dtype)
   1429 with ak._errors.OperationErrorContext(
   1430     "numpy.asarray", (self,), {"dtype": dtype}
   1431 ):
   1432     from awkward._connect.numpy import convert_to_array
-> 1434     return convert_to_array(self._layout, dtype=dtype)

File [~/coffea-gpu/awkward/src/awkward/_connect/numpy.py:481](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/numpy.py#line=480), in convert_to_array(layout, dtype)
    480 def convert_to_array(layout, dtype=None):
--> 481     out = ak.operations.to_numpy(layout, allow_missing=False)
    482     if dtype is None:
    483         return out

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py:48](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py#line=47), in to_numpy(array, allow_missing)
     45 yield (array,)
     47 # Implementation
---> 48 return _impl(array, allow_missing)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py:60](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py#line=59), in _impl(array, allow_missing)
     57 backend = NumpyBackend.instance()
     58 numpy_layout = layout.to_backend(backend)
---> 60 return numpy_layout.to_backend_array(allow_missing=allow_missing)

File [~/coffea-gpu/awkward/src/awkward/contents/content.py:1020](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/content.py#line=1019), in Content.to_backend_array(self, allow_missing, backend)
   1018 else:
   1019     backend = regularize_backend(backend)
-> 1020 return self._to_backend_array(allow_missing, backend)

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:2072](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=2071), in ListOffsetArray._to_backend_array(self, allow_missing, backend)
   2070     return buffer.view(np.dtype(("S", max_count)))
   2071 else:
-> 2072     return self.to_RegularArray()._to_backend_array(allow_missing, backend)

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:283](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=282), in ListOffsetArray.to_RegularArray(self)
    278 _size = Index64.empty(1, self._backend.index_nplike)
    279 assert (
    280     _size.nplike is self._backend.index_nplike
    281     and self._offsets.nplike is self._backend.index_nplike
    282 )
--> 283 self._backend.maybe_kernel_error(
    284     self._backend[
    285         "awkward_ListOffsetArray_toRegularArray",
    286         _size.dtype.type,
    287         self._offsets.dtype.type,
    288     ](
    289         _size.data,
    290         self._offsets.data,
    291         self._offsets.length,
    292     )
    293 )
    294 size = self._backend.index_nplike.index_as_shape_item(_size[0])
    295 length = self._offsets.length - 1

File [awkward/src/awkward/_backends/backend.py:67](awkward/src/awkward/_backends/backend.py#line=66), in Backend.maybe_kernel_error(self, error)
     65     return
     66 else:
---> 67     raise ValueError(self.format_kernel_error(error))

ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-35/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    numpy.asarray(
        <Array [[0.2120250193289826, ...], ...] type='50 * var * float64'>
        dtype = None
    )

ianna commented 3 months ago

duplicate of https://github.com/NVIDIA/numba-cuda/issues/36

gmarkall commented 3 months ago

Reopening this as it's a much better-written request than mine in #36. My notes from #36:

Discussion originally from Allow libraries that implement __array_ufunc__ to override CUDAUFuncDispatcher on the Numba Discourse.
There was a PR implementing this for the CPU target: https://github.com/numba/numba/pull/8995
A related issue on the Awkward issue tracker: https://github.com/scikit-hep/awkward/issues/3179
This is to support using Coffea on CUDA.
The CUDA Ufunc mechanism doesn't yet support dynamic Ufuncs, which seem to be a part of the implementation of the PR for the CPU target.
The __array_ufunc__ mechanism seems CPU-centric - will we need to define a new __cuda_array_ufunc__ mechanism for this to be practical? c.f. __cuda_array_interface__ vs. __array_interface__.

lgray commented 3 months ago

Awesome! Thank you both for getting this started.

lgray commented 3 months ago

An off the cuff two cents: but wouldn't __cuda_array_interface__ + __array_ufunc__ imply __cuda_array_ufunc__, so a new interface may not be entirely necessary?

Dispatch to other ufuncs is already handled quite well by awkward. i.e. an awkward array with cuda backend can be straightforwardly passed to np.abs (or similar) and it correctly dispatches to cp.abs.

lgray commented 3 months ago

Just to bump this - has there been any further thought in this direction?

gmarkall commented 3 months ago

@lgray - thanks for the bump - I haven't been able to look into this further yet as I don't have enough of a grasp of the concepts to sketch out an implementation plan further without spending time doing some research... I think you might be a bit more ahead of me in your thinking about this - do you have some thoughts about what the implementation should / could look like?

lgray commented 3 months ago

@gmarkall I don't really have recommendations on low level implementation, but I do know how we would like things to operate from a high level.

Essentially we'd like our data scientists (high energy particle physics experiment scientists and PhD students) to we able to design analyses on their laptops for CPU and redeploy it with a few configuration changes on GPU using awkward array.

We can detect when data is on GPU vs. CPU and switch between kernels automatically and easily with awkward, so that's largely a matter of user interface.

What we need from numba-cuda is for it to interact seamlessly with awkward arrays that are on-device as well as it already does with host-side arrays and regular numba.

Training users to write effective cuda kernels with numba is a different matter entirely that we will not touch here. I'm just considering pretty simple ufuncs that you get through @vectorize.

So really, on the backend we just need it to be able to identify ufuncs and then to be able to distinguish when those ufuncs accept device side arrays. So I think some scaffolding is missing, and not much else, essentially to smoothen the experience on the user side? I'm not quite aware of the entry points to change things to give it a shot, off the top of my head.

lgray commented 2 months ago

@gmarkall I was talking to @jpivarski and @ianna last week and I hadn't realized that cupy itself didn't implement a nep13-like protocol when calling the cupy version of the ufunc. So this makes it clear why this had problems working in the first place and now I agree that we need something like nep13 so that awkward can detect and override the application of cupy specific ufuncs. Then we can use that with numba and we're in a much better place.

Is this a more accurate understand of the situation from my side?

gmarkall commented 2 months ago

It sounds like you've mapped out the issue and what we need to do to resolve it a bit further - I still don't have any expertise in this area so I can't comment definitively, but what you said makes sense and it seems to give us more understanding of the situation.

Do we need to have a feature request in CuPy for NEP13 or some NEP13-like support?

ianna commented 2 months ago

Yes, I think we need to ask @leofang. I will open an issue on CuPy github.

NVIDIA / numba-cuda

[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37