[FEA] Allow libraries that implement __array_ufunc__ to override CUDAUFuncDispatcher

Details / notes:

There was a PR implementing this for the CPU target: https://github.com/numba/numba/pull/8995

This is to support using Coffea on CUDA.

cc @ianna @lgray

Initial thoughts:

I am not able to dig deep into this right now, but there are a couple of additional potential hurdles for the CUDA target:

The CUDA Ufunc mechanism doesn't yet support dynamic Ufuncs, which seem to be a part of the implementation of the PR for the CPU target.
The __array_ufunc__ mechanism seems CPU-centric - will we need to define a new __cuda_array_ufunc__ mechanism for this to be practical? c.f. __cuda_array_interface__ vs. __array_interface__.

NVIDIA / numba-cuda