Refactor all kernels into a generic "parallel for" algorithm that supports grid-stride and block-stride loops, configurable with model flag.
Use Occupancy APIs to portably handle devices of all sizes.
Refactor CUDA memory allocation APIs.
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
Fixes a division by zero bug in the solution checking
Add Serial
By @tom91136 Good thing to have when comparing with other parallel programming models, mostly for syntax.
This also makes us consistent with CloverLeaf, TeaLeaf, and miniBUDE.
Reuse Memory
This PR puts benchmarks in control of allocating the host
memory used for verifying the results.
This enables benchmarks that use Unified Memory for the device
allocations, to avoid the host-side allocation and just pass
pointers to the device allocation to the benchmark driver.
Merge the 3 implementations into one with different flags for data c++17, data c++23, and indices.
Also annotate workarounds with a #define WORKAROUND and print whether the current implementation is not conforming.
Adds support for AdaptiveCpp (CI not added yet; will be done later as part of removing hipSYCL).
Cleanup CUDA
Add Serial
By @tom91136 Good thing to have when comparing with other parallel programming models, mostly for syntax. This also makes us consistent with CloverLeaf, TeaLeaf, and miniBUDE.
Reuse Memory
This PR puts benchmarks in control of allocating the host memory used for verifying the results.
This enables benchmarks that use Unified Memory for the device allocations, to avoid the host-side allocation and just pass pointers to the device allocation to the benchmark driver.
Closes https://github.com/UoB-HPC/BabelStream/issues/128 .
Cleanup C++ Standard Parallelism
Merge the 3 implementations into one with different flags for data c++17, data c++23, and indices. Also annotate workarounds with a
#define WORKAROUND
and print whether the current implementation is not conforming. Adds support for AdaptiveCpp (CI not added yet; will be done later as part of removing hipSYCL).