[EPIC] Optimize `thrust::transform` for newer architectures

bernhardmgruber commented 1 month ago

Motivation It's increasingly harder to reach SOL on newer GPU architectures, starting with A100 and H100, especially for simple kernels, like: thrust::transform(..., thrust::plus{}), which basically load a few values and perform little compute. CUB algorithms already counter this by processing several elements per thread, but internal research hints at the necessity to further increase the amount of data in flight.

Use case thrust::transform is an important primitive for many algorithms and also occurs in BabelStream, i highly relevant HPC benchmark often used to produce representative numbers to compare the performance of hardware architectures. We should therefore dedicate some effort to ensure thrust::transform performs well.

Approach The main strategy is to have more "bytes in flight" when reading, with the concrete amount depending on the target architecture (tuning parameter). There are multiple ways to generate more loads. Again, internal research points to using either prefetching or the tensor memory accelerator (TMA, e.g. via memcpy_async) on newer architectures. Excessive unrolling and loading to registers works as well, but has the drawback of consuming large amount of registers for architectures requiring a large number of bytes in flight.

Address stability For the loading strategy we have to consider the address stability of data items as well. Users sometimes rely on the ability to retrieve the index inside an input array from the reference of a loaded element:

transform(par, a, a + n, a, [a,b,c](const T& e) { 
    const auto i = &e – a;     // &e expected to point into global memory
    return e + b[i] + c[i];
});

Such a user-provided function object inhibits any optimization which loads elements from global memory into registers or shared memory before passing them as arguments, thus only allowing prefetching as optimization. Address oblivious function objects can benefit from a larger variety of optimizations (like TMA or pipelined loading to registers.

Further concerns Furthermore, the computational intensity and shared memory/register consumption of the user provided function object influence the loading strategy. Longer computations seem to require more data in flight. Shared memory is contested by TMA and user-side computation. Register pressure limits unrolling.

Status quo thrust::transform (CUDA) is currently built on top of cub::DeviceFor::Bulk, which eventually dispatches independently of the uses data types or number of input and output streams. Because cub::DeviceFor::Bulk is index based, the involved input and output data streams are not visible and no tuning based on this information is possible. The situation is similar with cub::DeviceFor::ForEach et al.

Strategy I propose to add a new CUB algorithm cub::DeviceTransform governing transformations of N input streams into a single output stream (maybe M output streams if use cases arrise) and rebasing thrust::transform on top of it.

### Future tasks after merging  `cub::DeviceTransform`
- [ ] https://github.com/NVIDIA/cccl/issues/2091
- [ ] Split BabelStream benchmarks by number if iterators and tune individually
- [ ] https://github.com/NVIDIA/cccl/issues/2263
- [ ] Large offset support in CUB API
- [ ] Automatic address stability detection
- [ ] Mark all thrust, CUB and libcu++ functors as address oblivious, where possible
- [ ] Optimize for mixing contiguous and non-contiguous iterators (e.g. buffers and a counting iterator)
- [ ] Refactor prefetching and TMA loading into new CUB block load algorithms for reuse (requires CUB block load redesign)
- [ ] Allow dispatching on problem size if necessary
- [ ] Ensure BabelStream (upstream) takes advantage of these improvements
- [ ] Try to mark non-aliased input streams with `__restrict__`
- [ ] Beyond: Try prefetching and TMA block loads in other algorithms
- [ ] Port `thrust::transform_if` to CUB as well
- [ ] Port `thrust::transform_if` with stencil to CUB as well

bernhardmgruber commented 1 month ago

It turns out the C++ standard does not guarantee address stability of function arguments passed to user-provided callables in the context of parallel algorithms. See:

https://eel.is/c++draft/algorithms.parallel#user-1
https://eel.is/c++draft/algorithms.parallel#exec-3 The only exception are for_each and for_each_n:
exception for for_each: https://eel.is/c++draft/alg.foreach#9
exception for for_each_n: https://eel.is/c++draft/alg.foreach#25 IIRC from a discussion with @gevtushenko , this is consistent with what Thrust/CUB should promise. However, I could not find any solid hints in our documentation. We may thus break users when changing this behavior who relied on this effect by accident (see also: Hyrum's Law).

bernhardmgruber commented 1 month ago

We discussed address stability again today and concluded the following:

Don't break users relying on stable parameter addresses, even if the C++ standard does not give that guarantee.
Provide an opt-in in Thrust, so users can communicate they do not require stable addresses
On top, try to detect whether users take parameters by value and if they do, allow copying of data.
Update our documentation and examples to take parameters by value
Provide a dual API in CUB, like cub::DeviceFor, with address-stable and unstable functions.

bernhardmgruber commented 1 month ago

Address stability: Because I just encountered it in the tests on my A6000. If we use a kernel serving parameters from shared memory and the user performs pointer arithmetic with a pointer to global memory, the kernel crashes (Release build) and the following error code is reported at the next cudaDeviceSynchronize:

717 (operation not supported on global/shared address space)

That's at least better than a garbage result and the kernel continuing with wrong data.

NVIDIA / cccl

[EPIC] Optimize `thrust::transform` for newer architectures #1947