NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.09k stars 127 forks source link

Allow custom tuning policies to be passed into device algorithms. #855

Open sh1ng opened 5 years ago

sh1ng commented 5 years ago

When I used an iterator as an input for device-reduce reducing kernel was limited by amount of registers. The iterator does a few math operation on data in global memory plus branching. Degreasing default parameters in dispatch_reduce.cuh resulted in slight performance improvement, but I saw that it affected performance of simple reduction. I kept those changes because it improves total performance. Do you think it wise to add an optional parameter to specify execution policy for every device operation? What tricks can also be used to improve performance for a pipeline like read data from global memory -> deterministic logic on it -> cub operation like reduce?

alliepiper commented 3 years ago

I agree that it would be very useful to provide custom tuning policies when invoking device algorithms. I think it might already be possible to inject one somewhere in the Device -> Dispatch -> Agent stack, but ideally we should provide a more user-friendly API that we can document and test.