Fix performance degradation of HIP dot

The workload of dot calculation is not consistent among the different implementations. The larger the arraysize, the longer it takes for the HIP version to complete.

# hip-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1376603.333 0.01248     0.01266     0.01251
# cuda-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1444860.830 0.01189     0.01199     0.01193

The HIP version currently uses arraysize to determine 'dot_num_blocks', which is used as kernel grid size and iteration count for reduction in the host code. The CUDA counterpart uses the number of SM (based on GPU specs) to determine 'dot_num_blocks'. The result should be more reliable with the CUDA one because of higher occupancy and more reasonable overhead of reduction on the host.

UoB-HPC / BabelStream

Fix performance degradation of HIP dot #207