Support unaligned workloads in reproducible reduction

As we kept simplifying the reproducible reduction kernel, we removed the code path that handles contiguous iterators that are not aligned at 16 bytes (float4). We should investigate the following options:

Return runtime check of input pointer alignment: if this change doesn't introduce performance regressions, we should select it (best approach from the compilation time perspective).
Split reproducible reduction kernel into 16-bytes aligned and 4-bytes aligned versions: this option should be selected only if the first option leads to performance regressions of more than 5%. This approach will lead to kernel duplication and slowest compilation times.
Get rid of the vectorized input processing

This issue can be closed by:

a passed test case of reproducible reduction with thrust::device_vector<float> vec(n); reduce(vec.begin() + 1, n - 1)
NVBench result illustrating absence of performance regression from this change

NVIDIA / cccl

Support unaligned workloads in reproducible reduction #2120