Cleanup CUDA

Refactor all kernels into a generic "parallel for" algorithm that supports grid-stride and block-stride loops, configurable with model flag.
Use Occupancy APIs to portably handle devices of all sizes.
Refactor CUDA memory allocation APIs.
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
- Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
- Fixes a division by zero bug in the solution checking

Add Serial

By @tom91136 Good thing to have when comparing with other parallel programming models, mostly for syntax. This also makes us consistent with CloverLeaf, TeaLeaf, and miniBUDE.

Reuse Memory

This PR puts benchmarks in control of allocating the host memory used for verifying the results.

This enables benchmarks that use Unified Memory for the device allocations, to avoid the host-side allocation and just pass pointers to the device allocation to the benchmark driver.

Closes https://github.com/UoB-HPC/BabelStream/issues/128 .

Cleanup C++ Standard Parallelism

Merge the 3 implementations into one with different flags for data c++17, data c++23, and indices. Also annotate workarounds with a #define WORKAROUND and print whether the current implementation is not conforming. Adds support for AdaptiveCpp (CI not added yet; will be done later as part of removing hipSYCL).

UoB-HPC / BabelStream

Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

Cleanup CUDA

Add Serial

Reuse Memory

Cleanup C++ Standard Parallelism