Situation

CUB kernel templates that implement algorithms have a template parameter, usually named OffsetT, which is used to index into the iterators.

kernel_interface

kernel_implementation

While the kernels and, hence, the implementation is templatized on an OffsetT, the API-entry points of many CUB device-scope algorithms are hard-coded to use int as the offset type.

device_interface

This becomes problematic as algorithms that, today, are using int as the offset type are limited to processing at most INT_MAX number of items. As device memory capacity keeps increasing, our users are running into this limitation more and more frequently.

Current state of offset type handling in CUB

The CUB algorithms are using a wide range of different ways to handle the offset types with which kernel templates are instantiated:

NumItemsT/OffsetT: Some algorithms use a template parameter on the Device* interface for the num_items parameter and pass that template parameter, NumItemsT or OffsetT, to the corresponding kernel template.
choose_offset_t: Aforementioned NumItemsT/OffsetT method has the downside that users may unconsciously instantiate the kernels for all possible offset types, like int32_t, uint32_t, int64_t, and uint64_t, which drastically increases the compilation time on the user side. To limit the number of kernel template instantiations, we have a choose_offset_t alias template that casts any type of up to four bytes to an uint32_t and all larger types to uint64_t.
int32_t/uint32_t: These algorithms use a hard-coded offset type, usually int32_t and always instantiate the kernel templates with the respective type.

Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to https://github.com/NVIDIA/cccl/issues/50#issuecomment-1956325564 as source-of-truth:

legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first

algorithm	offset type	tests larger-than`INT_MAX`	tests close to `[U]INT_MAX`
`device_adjacent_difference.cuh`	✅ `choose_offset_t`	✅ 2^33, sanity check, iterators	🟡
`device_copy.cuh`	🟡`num_ranges`: `uint32_t` 🟡`buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_for.cuh`	🟡`NumItemsT`: ForEachN, ForEachCopyN, Bulk `difference_type`: ForEach, ForEachCopy	🟡	🟡
`device_histogram.cuh`	🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX; OffsetT otherwise	🟡	🟡
`device_memcpy.cuh`	🟡 `num_ranges`: `uint32_t` 🟡 `buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_merge_sort.cuh`	🟡 `NumItemsT`	✅ extensive check	✅ extensive check
`device_partition.cuh`	🟠 `int`	🟠	🟠
`device_radix_sort.cuh`	✅ `choose_offset_t`	✅ extensive check	✅ extensive check
`device_reduce.cuh`	✅ `choose_offset_t`: `Reduce`, `Sum`, `Min`, `Max`, `ReduceByKey`, `TransformReduce` ⚠️ (note) `int`: `ArgMin`, `ArgMax`	✅ sanity, 2^{30,31,33)	✅ sanity, 2^32-1
`device_run_length_encode.cuh`	🟠 `int`	🟠	🟠
`device_scan.cuh`	🟠 `int`	🟠	🟠
`device_segmented_radix_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_segmented_reduce.cuh`	🟡 `common_iterator_value_t({Begin,End}OffsetIteratorT)`: `Reduce`, `Sum`, `Min`, `Max` ⚠️ (note) `int`: `ArgMin`, `ArgMax` `num_segments`: `int`	✅ sanity, rnd [2^31; 2^33]	🟡
`device_segmented_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_select.cuh`	✅ `choose_offset_t`: `UniqueByKey` 🟠 `int`: Flagged, `If`, `Unique`
`device_spmv.cuh`	🟠 `int`	🟠	🟠

Summary:

u32/u64: 4 algorithm groups
i32: 9 algorithm groups
OffsetT: 5 algorithm groups
Dynamic: 1 algorithm group

Goals

Primary goals:

Enable users to run algorithms on more than INT_MAX number of items on all algorithms that currently do not support that

Secondary goals:

Unify offset types across CUB algorithms to reduce mental load during maintenance and give users a common offset type passed to their potentially custom-built fancy iterators
Potentially improve performance of existing algorithms, in particular when instantiated with a large number of items, e.g., by making use of bit-packed tile states, etc.

Considerations:

Compile time & binary size
Performance
- Immediate performance impact without re-tuning algorithms
- Long-term achievable performance with re-tuning algorithms
Maintenance cost:
- Testing effort
- Tuning effort
One-time effort (e.g., to mitigate performance downside after changing the offset types)

Options

This section presents the options that we consider to support a large number of items for the algorithms that currently do not support 64-bit offset types.

Using the user-provided offset type

This option will make the offset type a template parameter of the Device* interface and pass that template parameter through to the kernel template.

Pros

There’s no performance change when the user provides the offset type that matches the previously hard-coded offset type of that algorithm.

Cons

Risk for users to unintentionally instantiate kernel templates with all the different offset types
Some offset types may exhibit quite poor performance for some offset types
We need to test for all offset types
We need to tune for all offset types

Using i32 & i64 offset types

This option will simply instantiate the existing kernel templates with either int32_t or int64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type	Offset type used by kernel
smaller-than-32-bit	`int32_t`
`int32_t`	`int32_t`
`uint32_t`	`int64_t`
`int64_t`	`int64_t`
`uint64_t`	`int64_t`

Pros

No re-tuning for algorithms tuned for i32, iff the user-provided offset type is i32 or of a smaller width

Cons

If the user provides an offset type of u32, we need to dispatch to a kernel template that uses a 64-bit offset type, which usually causes more severe performance degradations than when using an unsigned 32-bit offset type.
We'll probably have to return a cuda error if, at run time, the user provides more than INT64_MAX number of items

Using u32 & u64 offset types

This option will simply instantiate the existing kernel templates with either uint32_t or uint64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type	Offset type used by kernel
smaller-than-32-bit	`uint32_t`
`int32_t`	`uint32_t`
`uint32_t`	`uint32_t`
`int64_t`	`uint64_t`
`uint64_t`	`uint64_t`

Pros

When the user provides a 32-bit offset type, we keep the offset type to 32 bits.

Cons

For algorithms previously using a hard-coded offset type of int32_t, changing to uint32_t may degrade performance.

Using bit-packed tile state for algorithms carrying offset type in decoupled look-back

DevicePartition, DeviceSelect, DeviceReduceByKey, and DeviceRunLengthEncode carry the offset type in the tile value of the decoupled look back, requiring 128-bit wide memory transactions of a tile that comprises a value and status. There are four different statuses that a tile can have, requiring two bits to represent. We can allocate two bits from the offset (i.e., the tile’s value) and keep the memory transactions at 64 bits.

Pros This may generally be a more efficient approach than using off-the-shelf TileStates for the decoupled look-back of these algorithms

Cons We need to return a CUDA error at run time, if the number of items exceeds (2^62)-1

Streaming approach using i32 offsets

We can repeatedly invoke the algorithms on partitions of up to INT_MAX number of items at a time. In simple terms, with each invocation we process one partition and we would just offset the input iterators by the previous partitions’ accumulated size.

Variants

Partitioning point:

Specialization on th Dispatch* layer
Specialization on the Device* interface

Implementation:

Only specialize kernels for uint32_t and larger user-provided offset types. For int32_t the sass would remain the same in this case.
One and the same kernel for any offset type: sass for int32_t would be different than it was before the chance, but we have one and the same kernel template for any user-provided offset type.

User-provided offset type	Offset type used by kernel
smaller-than-32-bit	`int32_t` (a): unchanged code path, (b) using streaming implementation
`int32_t`	`int32_t` (a): unchanged code path, (b) using streaming implementation
`uint32_t`	`int32_t`
`int64_t`	`int32_t`
`uint64_t`	`int32_t`

Pros

Supports larger-than 2^32-1 items with a 32-bit offset type.
- Possibly, just one kernel for all user-provided offset types.
Lower memory requirements (no need to allocate memory for all tile states but only for the tile states of a single partition)
May serve as a stepping stone for exhibiting a streaming interface that allows users to:
- support pipelining (interleaving host-to-device memcpies with algorithm execution)
- lower memory requirements (only need to fit one partition of input at a time)

Cons

User invokes one algorithm but may see repetitive kernel invocations
If implemented on the Dispatch layer, this causes a different behavior for users previously instantiating the Dispatch template class with larger offset types.
We may run into load imbalance within one partition, if say, one thread block has a very complex tile of items. That thread block may be working away as the last of its partition, yet the kernel of the subsequent kernel cannot be launched yet. This may be less of a concern for something like DeviceSelect but more of a problem for batched and segmented algorithms.

NVIDIA / cccl

Describe potential solutions for algorithm implementations and their pros/cons #1771

Situation

Current state of offset type handling in CUB

Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to https://github.com/NVIDIA/cccl/issues/50#issuecomment-1956325564 as source-of-truth:

Goals

Options

Using the user-provided offset type

Using i32 & i64 offset types

Using u32 & u64 offset types

Using bit-packed tile state for algorithms carrying offset type in decoupled look-back

Streaming approach using i32 offsets

Variants

Implementation: