Below is a list of tasks in prioritized order. We should start with algorithms that already exist in CUB. This will allow delivering CUB optimizations into Thrust sooner.
[ ] Refactor thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey
[ ] Refactor thrust/extrema.h to use cub::DeviceReduce
[ ] Make sure that thrust calls to cub algorithms use the Device* interface (i.e., to make sure we use the index type that is optimized for the cub algorithm) - or alternatively go via Dispatch* interface but make sure to use the right offset type
[ ] Port thrust/thrust/system/cuda/detail/set_operations.h to CUB
A few notes:
thrust::partition_copy and thrust::stable_partition_copy require taking two separate/distinct output iterators: one for the selected and one for the rejected items. DevicePartition, however, currently only supports a single output iterator, where the selected items are written to the beginning in order and the rejected items are written to the end in reverse order, respectively. Supporting the these two thrust algorithms requires extending AgentSelectIf, implementing overloads for methods like ScatterTwoPhase that are concerned with writing rejected items to the output iterators.
Partition and copy doesn't have stencil overload in CUB. We might experiment with using ::Flagged version along with a transform iterator.
While porting thrust::reduce_by_key we need to decide on accumulator type.
Below is a list of tasks in prioritized order. We should start with algorithms that already exist in CUB. This will allow delivering CUB optimizations into Thrust sooner.
Document procedure
Replace Thrust Algorithms with CUB
thrust::reduce_by_key
to usecub::DeviceReduce::ReduceByKey
thrust/extrema.h
to usecub::DeviceReduce
Device*
interface (i.e., to make sure we use the index type that is optimized for the cub algorithm) - or alternatively go viaDispatch*
interface but make sure to use the right offset typePort Thrust Algorithms into CUB
thrust/thrust/system/cuda/detail/set_operations.h
to CUBA few notes:
thrust::partition_copy
andthrust::stable_partition_copy
require taking two separate/distinct output iterators: one for the selected and one for the rejected items.DevicePartition
, however, currently only supports a single output iterator, where the selected items are written to the beginning in order and the rejected items are written to the end in reverse order, respectively. Supporting the these two thrust algorithms requires extendingAgentSelectIf
, implementing overloads for methods likeScatterTwoPhase
that are concerned with writing rejected items to the output iterators.::Flagged
version along with a transform iterator.thrust::reduce_by_key
we need to decide on accumulator type.