Allow custom tuning policies to be passed into device algorithms.

When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory plus branching. Degreasing default parameters in dispatch_reduce.cuh resulted in slight performance improvement, but I saw that it affected performance of simple reduction. I kept those changes because it improves total performance. Do you think it wise to add an optional parameter to specify execution policy for every device operation? What tricks can also be used to improve performance for a pipeline like read data from global memory -> deterministic logic on it -> cub operation like reduce?

NVIDIA / cccl

Allow custom tuning policies to be passed into device algorithms. #855