NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.17k stars 139 forks source link

Support unaligned workloads in reproducible reduction #2120

Open gevtushenko opened 2 months ago

gevtushenko commented 2 months ago

As we kept simplifying the reproducible reduction kernel, we removed the code path that handles contiguous iterators that are not aligned at 16 bytes (float4). We should investigate the following options:

  1. Return runtime check of input pointer alignment: if this change doesn't introduce performance regressions, we should select it (best approach from the compilation time perspective).
  2. Split reproducible reduction kernel into 16-bytes aligned and 4-bytes aligned versions: this option should be selected only if the first option leads to performance regressions of more than 5%. This approach will lead to kernel duplication and slowest compilation times.
  3. Get rid of the vectorized input processing

This issue can be closed by:

SAtacker commented 1 month ago

We do not need aligned iterators for the simple kernel as far as I know. Should we close this?