NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.11k stars 129 forks source link

Describe potential solutions for algorithm implementations and their pros/cons #1771

Closed elstehle closed 3 weeks ago

elstehle commented 3 months ago

Situation

CUB kernel templates that implement algorithms have a template parameter, usually named OffsetT, which is used to index into the iterators.

kernel_interface

kernel_implementation

While the kernels and, hence, the implementation is templatized on an OffsetT, the API-entry points of many CUB device-scope algorithms are hard-coded to use int as the offset type.

device_interface

This becomes problematic as algorithms that, today, are using int as the offset type are limited to processing at most INT_MAX number of items. As device memory capacity keeps increasing, our users are running into this limitation more and more frequently.

Current state of offset type handling in CUB

The CUB algorithms are using a wide range of different ways to handle the offset types with which kernel templates are instantiated:

Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to https://github.com/NVIDIA/cccl/issues/50#issuecomment-1956325564 as source-of-truth:

legend for offset type: βœ… considered done | 🟑 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: βœ… considered done | 🟑 to be done | 🟠 needs to support wider offset types first

algorithm offset type tests larger-thanINT_MAX tests close to [U]INT_MAX
device_adjacent_difference.cuh βœ… choose_offset_t βœ… 2^33, sanity check, iterators 🟑
device_copy.cuh 🟑num_ranges: uint32_t
🟑buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟑 🟑
device_for.cuh 🟑NumItemsT: ForEachN, ForEachCopyN, Bulk
difference_type: ForEach, ForEachCopy
🟑 🟑
device_histogram.cuh 🟑 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX;
OffsetT otherwise
🟑 🟑
device_memcpy.cuh 🟑 num_ranges: uint32_t
🟑 buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟑 🟑
device_merge_sort.cuh 🟑 NumItemsT βœ… extensive check βœ… extensive check
device_partition.cuh 🟠 int 🟠 🟠
device_radix_sort.cuh βœ… choose_offset_t βœ… extensive check βœ… extensive check
device_reduce.cuh βœ… choose_offset_t: Reduce, Sum, Min, Max, ReduceByKey, TransformReduce
⚠️ (note) int: ArgMin, ArgMax
βœ… sanity, 2^{30,31,33) βœ… sanity, 2^32-1
device_run_length_encode.cuh 🟠 int 🟠 🟠
device_scan.cuh 🟠 int 🟠 🟠
device_segmented_radix_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_segmented_reduce.cuh 🟑 common_iterator_value_t({Begin,End}OffsetIteratorT): Reduce, Sum, Min, Max
⚠️ (note) int: ArgMin, ArgMax
num_segments: int
βœ… sanity, rnd [2^31; 2^33] 🟑
device_segmented_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_select.cuh βœ… choose_offset_t: UniqueByKey
🟠 int: Flagged, If, Unique
device_spmv.cuh 🟠 int 🟠 🟠

Summary:

Goals

Primary goals:

Secondary goals:

Considerations:

Options

This section presents the options that we consider to support a large number of items for the algorithms that currently do not support 64-bit offset types.

Using the user-provided offset type

This option will make the offset type a template parameter of the Device* interface and pass that template parameter through to the kernel template.

Pros

Cons

Using i32 & i64 offset types

This option will simply instantiate the existing kernel templates with either int32_t or int64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type Offset type used by kernel
smaller-than-32-bit int32_t
int32_t int32_t
uint32_t int64_t
int64_t int64_t
uint64_t int64_t

Pros

Cons

Using u32 & u64 offset types

This option will simply instantiate the existing kernel templates with either uint32_t or uint64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type Offset type used by kernel
smaller-than-32-bit uint32_t
int32_t uint32_t
uint32_t uint32_t
int64_t uint64_t
uint64_t uint64_t

Pros

Cons

Using bit-packed tile state for algorithms carrying offset type in decoupled look-back

DevicePartition, DeviceSelect, DeviceReduceByKey, and DeviceRunLengthEncode carry the offset type in the tile value of the decoupled look back, requiring 128-bit wide memory transactions of a tile that comprises a value and status. There are four different statuses that a tile can have, requiring two bits to represent. We can allocate two bits from the offset (i.e., the tile’s value) and keep the memory transactions at 64 bits.

Pros This may generally be a more efficient approach than using off-the-shelf TileStates for the decoupled look-back of these algorithms

Cons We need to return a CUDA error at run time, if the number of items exceeds (2^62)-1

Streaming approach using i32 offsets

We can repeatedly invoke the algorithms on partitions of up to INT_MAX number of items at a time. In simple terms, with each invocation we process one partition and we would just offset the input iterators by the previous partitions’ accumulated size.

Variants

Partitioning point:

Implementation:

User-provided offset type Offset type used by kernel
smaller-than-32-bit int32_t (a): unchanged code path, (b) using streaming implementation
int32_t int32_t (a): unchanged code path, (b) using streaming implementation
uint32_t int32_t
int64_t int32_t
uint64_t int32_t

Pros

Cons