[FEA]: Specialize `Block{Load,Store[,Exchange]}` when `ITEMS_PER_THREAD` is `1`

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

BlockLoad provides a means to load data into a blocked arrangement. The following BlockLoadAlgorithm strategies load data with a striped, memory access-friendly pattern (i.e., neighbouring threads access neighbouring items in memory) followed by a BlockExchange to get the data from a striped (or warp-striped) arrangement into a blocked arrangement:

BLOCK_LOAD_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED

When ITEMS_PER_THREAD is 1, blocked and striped arrangements are equivalent. This means that (1) no data exchange amongst threads is actually needed and (2) we want to avoid allocating shared memory for TempStorage, as we don't need a scratchpad for data exchange. This applies to BlockLoad, BlockStore, and BlockExchange.

It's worth noting that BlockExchange provides the ScatterTo{Blocked,Striped} member functions for which we will still require TempStorage scratchpad for data exchange. Unfortunately, at the time of class instantiation, we don't know if these member functions will be used and, hence, I'm afraid we will need to keep allocating TempStorage scratchpad for BlockExchange, even if ITEMS_PER_THREAD is 1.

Thanks to @gevtushenko for suggesting this.

Describe the solution you'd like

Avoid superfluous data exchange and TempStorage allocations when ITEMS_PER_THREAD is 1.

Describe alternatives you've considered

No response

Additional context

No response

NVIDIA / cccl