RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.65k stars 226 forks source link

[BUG] likwid-perfctr segfaults on ICX when measuring MEM in direct mode #571

Closed ntippman closed 10 months ago

ntippman commented 10 months ago

Describe the bug likwid-perfctr segfaults when trying to measure MEM on a 2x Intel 8360Y system in direct mode. The accessdaemon-mode works just fine.

To Reproduce

To Reproduce with a LIKWID command Please supply the output of the command with -V 3 added to the command:

[ntippman@itp09 likwid]$ sudo LIKWID_FORCE=1 /usr/local/bin/likwid-perfctr -g MEM -S 1000ms -V 3

...

DEBUG - [perfmon_addEventSet:2301] Currently 1 groups of 2 active
DEBUG - [perfgroup_readGroup:873] Reading group MEM from /usr/local/share/likwid/perfgroups/ICX/MEM.txt
DEBUG - [perfmon_addEventSet:2362] Eventstring INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,TOPDOWN_SLOTS:FIXC3,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1,CAS_COUNT_RD:MBOX6C0,CAS_COUNT_WR:MBOX6C1,CAS_COUNT_RD:MBOX7C0,CAS_COUNT_WR:MBOX7C1
DEBUG - [access_x86_msr_read:215] Read MSR counter 0x38D with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:262] Write MSR counter 0x38D with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2481] Added event INSTR_RETIRED_ANY for counter FIXC0 to group 0
DEBUG - [access_x86_msr_read:215] Read MSR counter 0x38D with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:262] Write MSR counter 0x38D with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2481] Added event CPU_CLK_UNHALTED_CORE for counter FIXC1 to group 0
DEBUG - [access_x86_msr_read:215] Read MSR counter 0x38D with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:262] Write MSR counter 0x38D with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2481] Added event CPU_CLK_UNHALTED_REF for counter FIXC2 to group 0
DEBUG - [access_x86_msr_read:215] Read MSR counter 0x38D with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:262] Write MSR counter 0x38D with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2481] Added event TOPDOWN_SLOTS for counter FIXC3 to group 0
DEBUG - [checkAccess:244] WARNING: The device for counter MBOX0C0 does not exist
DEBUG - [perfmon_addEventSet:2412] Cannot access counter register MBOX0C0
DEBUG - [checkAccess:244] WARNING: The device for counter MBOX0C1 does not exist
DEBUG - [perfmon_addEventSet:2412] Cannot access counter register MBOX0C1
Segmentation fault

Full output

TomTheBear commented 10 months ago

Thanks for reporting.

That's my major problem with the counter registers being moved to MMIO space for ICX and newer. You have to access /dev/mem and no further security checks happen. E.g. I had problems when running on SPR because I used a simple memcpy for a struct (3 uint64_t values) and depending on the data accesses used by memcpy internally, the copy worked or not. I had to copy it manually in uint8_t steps to make it work reliably on all systems. There is no documentation how to access these registers, so I commonly assume I can use one full width read/write. Even on the most recent SPR arch, I found 64 bit registers that had to be read/written with 2x 32bit accesses.

It seems the MBOXes are only opened for socket 1 but addEventSet checks them on socket 0:

DEBUG - [access_x86_mmio_init:409] access_x86_mmio_init for socket 1
[...]
DEBUG - [perfmon_addEventSet:2481] Added event CPU_CLK_UNHALTED_REF for counter FIXC2 to group 0
DEBUG - [access_x86_msr_read:215] Read MSR counter 0x38D with RDMSR instruction on CPU 0
DEBUG - [access_x86_msr_write:262] Write MSR counter 0x38D with WRMSR instruction on CPU 0 data 0x0
DEBUG - [perfmon_addEventSet:2481] Added event TOPDOWN_SLOTS for counter FIXC3 to group 0

There is not access_x86_mmio_init for socket 0 in the logs.