NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.31k stars 165 forks source link

[EPIC] Investigate refactoring CuPy to use cuda.parallel #2958

Open jrhemstad opened 4 hours ago

jrhemstad commented 4 hours ago

Is this a duplicate?

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

Today, CuPy uses Thrust/CUB algorithms to implement much of it's functionality. That works today by precompiling Thrust algorithms for a variety of fixed types. This is undesirable for a few reasons: it increases binary size, it limits exposure of some algorithms (like segmented sort) due to combinatorial type explosion.

cuda.parallel can and should be able to replace any existing use of pre-instantiated Thrust/CUB algorithms and provide a few benefits:

Describe the solution you'd like

To start, we'd like to have an inventory of what Thrust/CUB stuff CuPy is using today and where.

From there, we should investigate how we can use cuda.parallel.reduce_into to replace existing uses of cub::DeviceReduce/thrust::reduce.

Describe alternatives you've considered

No response

Additional context

No response