Cleanup CUDA implementation a bit

gonzalobg commented 4 months ago

Refactor all kernels into a generic "parallel for" algorithm
- Supports grid-stride and block-stride loops, configurable with model flag
- Handles devices of different sizes via occupancy APIs
Refactor memory allocation APIs
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
- Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
- Fixes a division by zero bug in the solution checking

gonzalobg commented 4 months ago

This was passing. Seems like this and other PRs are spuriously failing due to some cache issue @tom91136 @tomdeakin

gonzalobg commented 3 months ago

Closing for #202

UoB-HPC / BabelStream