ROCm / rocHPL

High Performance Linpack for Next-Generation AMD HPC Accelerators
Other
41 stars 20 forks source link

Low performance of rocHPL compiled with Clang #12

Open Felloty opened 1 month ago

Felloty commented 1 month ago

Hello there!

I'm having a performance issue while running the rocHPL benchmark, which was compiled with the ROCm hipcc CXX compiler.

I have compiled two versions of rocHPL, using two different compilers: one version was compiled with the default GNU 7.5.0 CXX compiler, and the other was compiled with Clang 15.0.0, which is hipcc taken from the rocm/5.3.3/hip/bin directory.

For the first version (compiled with GNU), the basic install.sh script was used, along with the --with-rocm option to specify the rocm/5.3.3 directory, from which rocBLAS was also used. Additionally, the --with-mpi option was used to specify my previously installed OpenMPI.

The second version (compiled with Clang) was compiled using a modified version of the install.sh script, with -DCMAKE_CXX_COMPILER=hipcc in the cmake_common_options variable.

The other options for the install.sh script were the same as the first version.

When I run my benchmarks on the AMD Radeon Instinct MI50 32G with the same HPL.dat configuration, I get different results.

The GFLOPS achieved with hipcc rocHPL is always lower than that achieved with gnu rocHPL.

For example, here are the performance results for different configurations of the N parameter using the same HPL.dat file:

N  P   Q  VRAM  GFLOPS (gnu)  GFLOPS (hipcc)
45312     1  1  51%  3.687e+03  2.596e+03
54912     1  1  74%  4.143e+03  3.181e+03
62512     1  1  95%  4.345e+03  3.573e+03
63512     1  1  98%  4.387e+03  3.588e+03

What causes this to happen? How can I correctly compile rocHPL using hipcc and avoid performance issues during benchmarking?