[FEA]: Non-deterministic DeviceReduce

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

The std::transform_reduce algorithm does not require determinism, but an implementation on top of CUB is "pseudo-deterministic" (run-to-run deterministic on a given device, for a given cub version).

This prevents optimizing DeviceReduce with algorithms that do not uphold this.

Describe the solution you'd like

Add an option to DeviceReduce to control whether run-to-run determinism is enabled/disabled (defaulting it to enabled).

Describe alternatives you've considered

Not using CUB / Thrust.

Additional context

No response

NVIDIA / cccl