NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.17k stars 140 forks source link

[FEA]: Specialize `Block{Load,Store[,Exchange]}` when `ITEMS_PER_THREAD` is `1` #1127

Open elstehle opened 10 months ago

elstehle commented 10 months ago

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

BlockLoad provides a means to load data into a blocked arrangement. The following BlockLoadAlgorithm strategies load data with a striped, memory access-friendly pattern (i.e., neighbouring threads access neighbouring items in memory) followed by a BlockExchange to get the data from a striped (or warp-striped) arrangement into a blocked arrangement:

When ITEMS_PER_THREAD is 1, blocked and striped arrangements are equivalent. This means that (1) no data exchange amongst threads is actually needed and (2) we want to avoid allocating shared memory for TempStorage, as we don't need a scratchpad for data exchange. This applies to BlockLoad, BlockStore, and BlockExchange.

It's worth noting that BlockExchange provides the ScatterTo{Blocked,Striped} member functions for which we will still require TempStorage scratchpad for data exchange. Unfortunately, at the time of class instantiation, we don't know if these member functions will be used and, hence, I'm afraid we will need to keep allocating TempStorage scratchpad for BlockExchange, even if ITEMS_PER_THREAD is 1.

Thanks to @gevtushenko for suggesting this.

Describe the solution you'd like

Avoid superfluous data exchange and TempStorage allocations when ITEMS_PER_THREAD is 1.

Describe alternatives you've considered

No response

Additional context

No response

elstehle commented 4 months ago

Once we implement this optimization, we must ensure that AgentSelectIf is aware of the instances in which we avoid loading via shared memory (see https://github.com/NVIDIA/cccl/pull/1782).