Is your feature request related to a problem? Please describe.
Today, CuPy uses Thrust/CUB algorithms to implement much of it's functionality. That works today by precompiling Thrust algorithms for a variety of fixed types. This is undesirable for a few reasons: it increases binary size, it limits exposure of some algorithms (like segmented sort) due to combinatorial type explosion.
cuda.parallel can and should be able to replace any existing use of pre-instantiated Thrust/CUB algorithms and provide a few benefits:
Reduce binary size (going to JIT)
Custom type support
Custom operator support
Additional algorithm support (because JIT avoids the type combination problem)
Describe the solution you'd like
To start, we'd like to have an inventory of what Thrust/CUB stuff CuPy is using today and where.
From there, we should investigate how we can use cuda.parallel.reduce_into to replace existing uses of cub::DeviceReduce/thrust::reduce.
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
Today, CuPy uses Thrust/CUB algorithms to implement much of it's functionality. That works today by precompiling Thrust algorithms for a variety of fixed types. This is undesirable for a few reasons: it increases binary size, it limits exposure of some algorithms (like segmented sort) due to combinatorial type explosion.
cuda.parallel
can and should be able to replace any existing use of pre-instantiated Thrust/CUB algorithms and provide a few benefits:Describe the solution you'd like
To start, we'd like to have an inventory of what Thrust/CUB stuff CuPy is using today and where.
From there, we should investigate how we can use
cuda.parallel.reduce_into
to replace existing uses of cub::DeviceReduce/thrust::reduce.Describe alternatives you've considered
No response
Additional context
No response