Closed tom91136 closed 3 years ago
Instead of exposing the partitioner as a runtime option and introducing a layer of abstraction and indirection, maybe it needs to be exposed as a compile time option in a similar way to how we choose the different CUDA allocation options.
Ready for review again
This PR adds the TBB (or oneTBB) implementation, thus closing #74.
The implementation is fairly straightforward as TBB exposes a very similar API to SYCL but without device/host memory distinctions. We expose the partitioner parameter, common to most parallel algorithms, as the device option:
Partitioners isn't an abstract class so TBB implements it as overloads for each exported
parallel_*
function. To allow selecting partitioners as a runtime option, C++14 auto lambda parameters with forwarding was used to specialise for each case with minimal (very likely zero) runtime penalty.Performance should be comparable to GCC's parallel STL implementation which calls TBB internally. Benchmark results and scaling are coming soon, don't merge yet unless those aren't needed.