[EPIC] Investigate refactoring CuPy to use cuda.parallel

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

Today, CuPy uses Thrust/CUB algorithms to implement much of it's functionality. That works today by precompiling Thrust algorithms for a variety of fixed types. This is undesirable for a few reasons: it increases binary size, it limits exposure of some algorithms (like segmented sort) due to combinatorial type explosion.

cuda.parallel can and should be able to replace any existing use of pre-instantiated Thrust/CUB algorithms and provide a few benefits:

Reduce binary size (going to JIT)
Custom type support
Custom operator support
Additional algorithm support (because JIT avoids the type combination problem)

Describe the solution you'd like

To start, we'd like to have an inventory of what Thrust/CUB stuff CuPy is using today and where.

From there, we should investigate how we can use cuda.parallel.reduce_into to replace existing uses of cub::DeviceReduce/thrust::reduce.

Describe alternatives you've considered

No response

Additional context

No response

NVIDIA / cccl