Initial TBB implementation

tom91136 commented 3 years ago

This PR adds the TBB (or oneTBB) implementation, thus closing #74.

The implementation is fairly straightforward as TBB exposes a very similar API to SYCL but without device/host memory distinctions. We expose the partitioner parameter, common to most parallel algorithms, as the device option:

> tbb-stream --list
[0] auto partitioner
[1] affinity partitioner
[2] static partitioner
[3] simple partitioner
See https://spec.oneapi.com/versions/latest/elements/oneTBB/source/algorithms.html#partitioners for more details

Partitioners isn't an abstract class so TBB implements it as overloads for each exported parallel_* function. To allow selecting partitioners as a runtime option, C++14 auto lambda parameters with forwarding was used to specialise for each case with minimal (very likely zero) runtime penalty.

Performance should be comparable to GCC's parallel STL implementation which calls TBB internally. Benchmark results and scaling are coming soon, don't merge yet unless those aren't needed.

tomdeakin commented 3 years ago

Instead of exposing the partitioner as a runtime option and introducing a layer of abstraction and indirection, maybe it needs to be exposed as a compile time option in a similar way to how we choose the different CUDA allocation options.

tom91136 commented 3 years ago

Ready for review again

UoB-HPC / BabelStream

Initial TBB implementation #105