UoB-HPC / BabelStream

STREAM, for lots of devices written in many programming models
Other
323 stars 110 forks source link

Fix performance degradation of HIP dot #207

Open ddmatsu opened 3 months ago

ddmatsu commented 3 months ago

The workload of dot calculation is not consistent among the different implementations. The larger the arraysize, the longer it takes for the HIP version to complete.

# hip-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1376603.333 0.01248     0.01266     0.01251
# cuda-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1444860.830 0.01189     0.01199     0.01193

The HIP version currently uses arraysize to determine 'dot_num_blocks', which is used as kernel grid size and iteration count for reduction in the host code. The CUDA counterpart uses the number of SM (based on GPU specs) to determine 'dot_num_blocks'. The result should be more reliable with the CUDA one because of higher occupancy and more reasonable overhead of reduction on the host.