Refactor all kernels into a generic "parallel for" algorithm
Supports grid-stride and block-stride loops, configurable with model flag
Handles devices of different sizes via occupancy APIs
Refactor memory allocation APIs
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
Fixes a division by zero bug in the solution checking