RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.65k stars 226 forks source link

[BUG] likwid-perfctr segfaults on Zen4 for MEM1/MEM2 group #559

Closed ntippman closed 10 months ago

ntippman commented 11 months ago

Describe the bug Likwid-perfctr segfaults when trying to measure MEM1 or MEM2 group on Zen4. Tested on AMD EPYC 9654. This happens with likwid-perfctr and also with the LIKWID C-API.

To Reproduce

To Reproduce with a LIKWID command

[ntippman@gmz11 likwid]$ likwid-perfctr -C 2 -g MEM1 -S 200ms -V 3
--------------------------------------------------------------------------------
CPU name:   AMD EPYC 9654 96-Core Processor                
CPU type:   AMD K19 (Zen4) architecture
CPU clock:  2.40 GHz
CPU family: 25
CPU model:  17
CPU short:  zen4
CPU stepping:   1
CPU features:   FP MMX SSE SSE2 HTT MMX RDTSCP MONITOR SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3 
CPU arch:   x86_64
--------------------------------------------------------------------------------
[likwid-pin] Main PID -> hwthread 2 - OK
DEBUG - [HPMinit:98] Adjusting functions for x86 architecture in daemon mode
DEBUG - [access_x86_rdpmc_init:161] Test for RDPMC for PMC counters returned 1
DEBUG - [access_x86_rdpmc_init:205] Test for RDPMC for L3 counters returned 1
DEBUG - [access_x86_rdpmc_init:221] Test for RDPMC for DataFabric counters returned 1
DEBUG - [access_client_startDaemon:157] Starting daemon /usr/local/sbin/likwid-accessD
DEBUG - [access_client_startDaemon:197] Waiting for socket file /tmp/likwid-38622
DEBUG - [access_client_startDaemon:205] Socket file /tmp/likwid-38622 exists
DEBUG - [access_client_startDaemon:235] Successfully opened socket /tmp/likwid-38622 to daemon for CPU 2
DEBUG - [HPMaddThread:143] Adding CPU 2 to access module
Executing: 
DEBUG - [perfmon_addEventSet:2246] Currently 1 groups of 2 active
DEBUG - [perfgroup_readGroup:871] Reading group MEM1 from /usr/local/share/likwid/perfgroups/zen4/MEM1.txt
DEBUG - [perfmon_addEventSet:2425] Added event ACTUAL_CPU_CLOCK for counter FIXC1 to group 0
DEBUG - [perfmon_addEventSet:2425] Added event MAX_CPU_CLOCK for counter FIXC2 to group 0
DEBUG - [perfmon_addEventSet:2425] Added event RETIRED_INSTRUCTIONS for counter PMC0 to group 0
DEBUG - [perfmon_addEventSet:2425] Added event CPU_CLOCKS_UNHALTED for counter PMC1 to group 0
DEBUG - [checkAccess:231] WARNING: Counter DFC0 does not exist
DEBUG - [perfmon_addEventSet:2356] Cannot access counter register DFC0
DEBUG - [checkAccess:231] WARNING: Counter DFC1 does not exist
DEBUG - [perfmon_addEventSet:2356] Cannot access counter register DFC1
DEBUG - [checkAccess:231] WARNING: Counter DFC2 does not exist
DEBUG - [perfmon_addEventSet:2356] Cannot access counter register DFC2
Segmentation fault (core dumped)

When running with the C-API it segfaults when trying to perform perfmon_addEventSet with MEM1 or MEM2:

DEBUG - [perfmon_addEventSet:2205] Allocating new group structure for group.
DEBUG - [perfmon_addEventSet:2207] Currently 10 groups of 11 active
DEBUG - [perfgroup_readGroup:873] Reading group MEM1 from /usr/local/share/cb/collectd/share/likwid/perfgroups/zen4/MEM1.txt
DEBUG - [access_x86_msr_read:207] Read MSR counter 0xC00000E8 with RDMSR instruction on CPU 0
DEBUG - [perfmon_addEventSet:2386] Added event ACTUAL_CPU_CLOCK for counter FIXC1 to group 9
DEBUG - [access_x86_msr_read:207] Read MSR counter 0xC00000E7 with RDMSR instruction on CPU 0
DEBUG - [perfmon_addEventSet:2386] Added event MAX_CPU_CLOCK for counter FIXC2 to group 9
DEBUG - [access_x86_msr_read:207] Read MSR counter 0xC0010200 with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:254] Write MSR counter 0xC0010200 with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2386] Added event RETIRED_INSTRUCTIONS for counter PMC0 to group 9
DEBUG - [access_x86_msr_read:207] Read MSR counter 0xC0010202 with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:254] Write MSR counter 0xC0010202 with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2386] Added event CPU_CLOCKS_UNHALTED for counter PMC1 to group 9
DEBUG - [checkAccess:231] WARNING: Counter DFC0 does not exist
DEBUG - [perfmon_addEventSet:2317] Cannot access counter register DFC0
DEBUG - [checkAccess:231] WARNING: Counter DFC1 does not exist
DEBUG - [perfmon_addEventSet:2317] Cannot access counter register DFC1
DEBUG - [checkAccess:231] WARNING: Counter DFC2 does not exist
DEBUG - [perfmon_addEventSet:2317] Cannot access counter register DFC2
systemd-coredump[45229]: Process 45151 (cb-collectd) of user 0 dumped core.
systemd[1]: cb-collectd.service: Main process exited, code=dumped, status=11/SEGV
systemd[1]: cb-collectd.service: Failed with result 'core-dump'.
TomTheBear commented 10 months ago

Please compile LIKWID with DEBUG=true in config.mk (make distclean && make) and run:

$ gdb $LIKWID_BINDIR/likwid-lua
> r likwid-perfctr -C 0 -g MEM1 hostname
<wait for segfault>
> bt

From the output it seems that the DataFabric counters are not available on the system. Or I made a major mistake when adding them as they work neither in accessdaemon nor direct mode. Can you check what perf shows? There should be amd_df device(s) in /sys/devices/.

TomTheBear commented 10 months ago

My test system also does not provide the DataFabric counters but the error was caused in the general code by a strcmp.