RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.63k stars 226 forks source link

[Build] Roofline model on an AMD EPYC 9654 CPU in a CRAY EX system #604

Closed kadircs closed 5 months ago

kadircs commented 6 months ago

I am trying to get roofline model using likwid on a CRAY EX system. I tried user-space installation without root privileges in the PrgEnv-cray environment. I tried following both methods:

ACCESSMODE=accessdaemon
ACCESSMODE=perf_event

I am getting Setup of event ACTUAL_CPU_CLOCK on CPU 0 failed: Permission denied error as seen below:

$ salloc
salloc: Granted job allocation 803
$> export OMP_PROC_BIND=close; export OMP_STACKSIZE=64M; export PATH=$HOME/likwid-install/bin:$PATH;export LD_LIBRARY_PATH=$HOME/likwid-install/lib:$LD_LIBRARY_PATH;
$> srun likwid-perfctr -C 0 -g L2 hostname
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 9654 96-Core Processor
CPU type:       AMD K19 (Zen4) architecture
CPU clock:      2.40 GHz
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:1435] Permission denied.
Setup of event ACTUAL_CPU_CLOCK on CPU 0 failed: Permission denied
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:1435] Permission denied.
Setup of event MAX_CPU_CLOCK on CPU 0 failed: Permission denied
--------------------------------------------------------------------------------
nid00001
--------------------------------------------------------------------------------
Group 1: L2
+-------------------------------+---------+------------+
|             Event             | Counter | HWThread 0 |
+-------------------------------+---------+------------+
|        ACTUAL_CPU_CLOCK       |  FIXC1  |          0 |
|         MAX_CPU_CLOCK         |  FIXC2  |          0 |
|      RETIRED_INSTRUCTIONS     |   PMC0  |    1583469 |
|      CPU_CLOCKS_UNHALTED      |   PMC1  |    2220881 |
| REQUESTS_TO_L2_GRP1_ALL_NO_PF |   PMC2  |      65509 |
|        L2_PF_HIT_IN_L2        |   PMC3  |      21509 |
+-------------------------------+---------+------------+

+-------------------------------+------------+
|             Metric            | HWThread 0 |
+-------------------------------+------------+
|      Runtime (RDTSC) [s]      |     0.0058 |
|      Runtime unhalted [s]     |          0 |
|          Clock [MHz]          |      -     |
|              CPI              |     1.4025 |
|    L2 bandwidth [MBytes/s]    |   725.6401 |
|    L2 data volume [GBytes]    |     0.0042 |
| Prefetch bandwidth [MBytes/s] |   238.2542 |
| Prefetch data volume [GBytes] |     0.0014 |
+-------------------------------+------------+

Would you please help?

TomTheBear commented 6 months ago

The fixed counters on AMD sometimes do not work. This is not limited to Cray EX systems but is a more general setting. I have not found a way to check the usability of the fixed counters and I do not recommend their usage as they are not really accurate (in my experience).

The remaining counts for the general-purpose counters (PMC*) work and should give you the information you want.

kadircs commented 6 months ago

I just want to get the rooflines. Is there a tutorial to obtain the rooflines using the PMC* counters?

TomTheBear commented 6 months ago

https://github.com/RRZE-HPC/likwid/wiki/Tutorial%3A-Empirical-Roofline-Model

kadircs commented 6 months ago

I am trying to get my application's L1, L2, and LLC data traffic. The tutorial you shared showcases only DRAM traffic.

TomTheBear commented 6 months ago

L1 data traffic is difficult/impossible. no suitable events for almost all platforms. L2 and LLC traffic should work.

The tutorial showcases only DRAM traffic, that's right, but the other layers are comparable. You need the bandwidth_limit_of_level_X for the roofline and the operational intensity (FP_rate / measured_bandwidth_for_level_X) for the application dot. You can derive the bandwidth limit from the data sheet or measure it with likwid-bench. For private caches, you should use -W N:<numThreads * half_size_of_level_X_for_single_thread>:<numThreads>.

iustinouatu commented 5 months ago

@kadircs, how did you manage to install it for ACCESSMODE=accessdaemon without root privileges inside that HPC-machine?

I am getting this error when I do $make install:

$ make install ===> INSTALL access daemon to /likwid_folder/likwid_install_dir/sbin/likwid-accessD install: cannot change ownership of '/likwid_folder/likwid_install_dir/sbin/likwid-accessD': Operation not permitted make: *** [Makefile:394: install_daemon] Error 1

Thank you a lot!

PS : @TomTheBear, if you have any comments on this, I would greatly appreciate them.

kadircs commented 5 months ago

@iustinouatu you need to specify a installation directory that you have access to.

georgebisbas commented 4 months ago

Hi! Sorry to be reviving this! I am having the same issue, so asking just to clarify. I do not have root privileges anywhere in the HPC cluster.

I understand that @kadircs refers to a folder where you have root rights?

Please correct me if I am wrong.

Best, George

TomTheBear commented 4 months ago

As far as I know, @kadircs got an administrator to install with ACCESSMODE=accessdaemon on some selected nodes. He reported back through other channels which led to the fixes in https://github.com/RRZE-HPC/likwid/pull/618

There are only two ways to get memory traffic:

Of course, there are more complicated setups but all require interaction with the sysadmins.

georgebisbas commented 4 months ago

I see then, thanks.

Just to be clear. I am not focusing specifically on AMD hardware. So, my understanding is that some sort of admin rights on a node is definitely needed, right?

TomTheBear commented 4 months ago

Just to be clear. I am not focusing specifically on AMD hardware. So, my understanding is that some sort of admin rights on a node is definitely needed, right?

No not in all cases. If you choose ACCESSMODE=perf_event, you can install as user. Then it depends how restricted your system is configured (/proc/sys/kernel/perf_event_paranoid, lower -> more possibilities). For the Roofline Model, you need memory traffic measurements which require 0 or -1 (not recommended by me).

Some computing centers provide special job submission options to allow measurements (reduce the paranoid value). I documented that here how we do it in our center. I know other centers have something similar.

georgebisbas commented 4 months ago

Many thanks for this helpful response! I will have a look at our computing center as soon as possible! Thanks again!