Is your feature request related to a problem? Please describe.
BlockLoad provides a means to load data into a blocked arrangement. The following BlockLoadAlgorithm strategies load data with a striped, memory access-friendly pattern (i.e., neighbouring threads access neighbouring items in memory) followed by a BlockExchange to get the data from a striped (or warp-striped) arrangement into a blocked arrangement:
BLOCK_LOAD_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
When ITEMS_PER_THREAD is 1, blocked and striped arrangements are equivalent. This means that (1) no data exchange amongst threads is actually needed and (2) we want to avoid allocating shared memory for TempStorage, as we don't need
a scratchpad for data exchange. This applies to BlockLoad, BlockStore, and BlockExchange.
It's worth noting that BlockExchange provides the ScatterTo{Blocked,Striped} member functions for which we will still require TempStorage scratchpad for data exchange. Unfortunately, at the time of class instantiation, we don't know if these member functions will be used and, hence, I'm afraid we will need to keep allocating TempStorage scratchpad for BlockExchange, even if ITEMS_PER_THREAD is 1.
Once we implement this optimization, we must ensure that AgentSelectIf is aware of the instances in which we avoid loading via shared memory (see https://github.com/NVIDIA/cccl/pull/1782).
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
BlockLoad
provides a means to load data into a blocked arrangement. The followingBlockLoadAlgorithm
strategies load data with a striped, memory access-friendly pattern (i.e., neighbouring threads access neighbouring items in memory) followed by aBlockExchange
to get the data from a striped (or warp-striped) arrangement into a blocked arrangement:BLOCK_LOAD_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
When
ITEMS_PER_THREAD
is1
, blocked and striped arrangements are equivalent. This means that (1) no data exchange amongst threads is actually needed and (2) we want to avoid allocating shared memory forTempStorage
, as we don't need a scratchpad for data exchange. This applies toBlockLoad
,BlockStore
, andBlockExchange
.It's worth noting that
BlockExchange
provides theScatterTo{Blocked,Striped}
member functions for which we will still requireTempStorage
scratchpad for data exchange. Unfortunately, at the time of class instantiation, we don't know if these member functions will be used and, hence, I'm afraid we will need to keep allocatingTempStorage
scratchpad forBlockExchange
, even ifITEMS_PER_THREAD
is1
.Thanks to @gevtushenko for suggesting this.
Describe the solution you'd like
Avoid superfluous data exchange and
TempStorage
allocations whenITEMS_PER_THREAD
is1
.Describe alternatives you've considered
No response
Additional context
No response