RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.64k stars 226 forks source link

AMD 7773X peak performance mismatch between likwid and uprof #609

Closed kadircs closed 5 months ago

kadircs commented 6 months ago

Likwid reports ~15% greater peak flops with respect to uprof. Would you please help me finding my mistake while running likwid?

uprof SP 9015 GFlop/s
uprof DP 4507 GFlop/s
theoretical DP 4506 GFlop/s
likwid SP 10702 GFlops/s
likwid DP 5358 GFlops/s

Theoretical peak: 128 * 16 * 2.2 * = 4505.6 Gflop/s DP

L1 cache size is 32 KB. Number of threads is 128.

srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_sp_avx_fma -W N:2048kB:128   MFlops/s:               10724915.50
srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_sp_avx_fma -W N:1280kB:128   MFlops/s:               10701201.46 
srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_avx_fma -W N:2048kB:128  MFlops/s:               5358134.62
srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_avx_fma -W N:1280kB:128  MFlops/s:               5334774.14  
srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_avx_fma -W N:5120kB:128  MFlops/s:               5257747.25  
srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_avx_fma -W N:10240kB:128 MFlops/s:               5267745.97  
TomTheBear commented 6 months ago

In your theoretical peak calculation, you use the CPU base frequency of 2.2 GHz. This is reasonable since turbo mode is rarely used with all HW threads active. But in theory, the chip can overclock up to 3.5 GHz. Does likwid-bench report 2.2 GHz or some higher value? You can also wrap it with likwid-perfctr to get the actual clock frequency. Otherwise, no spontaneous idea.

kadircs commented 6 months ago

likwid-bench reports 2.2 GHz as seen below:

srun --nodes=1 --cpus-per-task=128 --threads-per-core=1 --partition=7773X -t 1-0:00 --hint=nomultithread likwid-bench -t peakflops_avx_fma -W N:2048kB:128
Cycles:                 3317739722
CPU Clock:              2200037484
Cycle Clock:            2200037484
Time:                   1.508038e+00 sec
Iterations:             134217728
Iterations per thread:  1048576
Inner loop executions:  500
Size (Byte):            2048000
Size per thread:        16000
Number of Flops:        8053063680000
MFlops/s:               5340093.99
Data volume (Byte):     2147483648000
MByte/s:                1424025.06
Cycles per update:      0.012360
Cycles per cacheline:   0.098876
Loads per update:       1
Stores per update:      0
Load bytes per element: 8
Store bytes per elem.:  0
Instructions:           1275068416032
UOPs:                   1207959552000