RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.67k stars 229 forks source link

[FeatureRequest] Add TMA_BACKEND, TMA_BE_MEMORY and TMA_BE_CORE counter group to likwid-perfctr #466

Open ibogosavljevic opened 2 years ago

ibogosavljevic commented 2 years ago

Is your feature request related to a problem? Please describe. I am frustrated when doing top-down analysis with Intel's VTUNE. The tool us cumbersome and affects the results. Also I cannot limit to the code I am interested in, only whole functions LIKWID already has TMA counter group, but it should move further with additional groups which are deep

Describe the solution you'd like I would like to be able to do the same analysis with LIKWID. For the beginning, I would like two additional groups: TMA_BE_MEMORY and TMA_BE_CORE. Here is the possible output:

TMA_BE_MEMORY

L1 Bound: 
L2 Bound:
L3 Bound:
DRAM Bound:
Store Bound:

TMA_BE_CORE

Divider: 19.7% of Clockticks
Port Utilization: 43.3% of Clockticks
    Cycles of 0 Ports Utilized: 19.5% of Clockticks
    Cycles of 1 Port Utilized: 7.1% of Clockticks
    Cycles of 2 Ports Utilized: 6.0% of Clockticks
    Cycles of 3+ Ports Utilized: 14.2% of Clockticks

Additional context You will probably need this to implement it:

This performance group measures cycles to determine percentage of time spent in front end, back end, retiring and speculation. These metrics are published and verified by Intel. Further information: Webpage describing Top-Down Method and its usage in Intel vTune: https://software.intel.com/en-us/vtune-amplifier-help-tuning-applications-using-a-top-down-microarchitecture-analysis-method Paper by Yasin Ahmad: https://sites.google.com/site/analysismethods/yasin-pubs/TopDown-Yasin-ISPASS14.pdf?attredirects=0 Slides by Yasin Ahmad: http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf The Intel Icelake microarchitecture provides a distinct register for the Top-Down Method metrics.

TomTheBear commented 2 years ago

I understand your request but it is tricky in detail. The problem with the TMA groups is that they might require more events than physical counter registers. Perf and VTune apply multiplexing by frequently rescheduling the events on the available counters. Both "drivers" are in kernel-space which can directly access the counters. LIKWID has a different focus by using the physical counters as basis, so you cannot program more events than counters. The TMA Level 1 (TMA group) requires all available physical counter registers at hardware thread level (FIXC0-3, PMC0-3) for Intel Skylake. For Icelake the distinct registers (TMA0-3) are used. Multiplexing could be added to LIKWID but since it runs completely in user-space, the event switching overhead is getting large. Even with ACCESSMODE=perf_event, LIKWID does not allow more events than physical counters.

If the TMA level can be measured with the available counters, you can create the performance groups you need yourself: https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#defining-custom-performance-groups . If a level requires multiple measurements, you can try to split the level metrics into multiple groups. It seems you want to use the MarkerAPI. There you can use LIKWID_MARKER_SWITCH to cycle through the groups but don't use it too often to keep the overhead as low as possible.

Map file for TMA Level to events for a specific architecture: https://download.01.org/perfmon/TMA_Metrics.xlsx (there are also CSV variants).

ibogosavljevic commented 2 years ago

I investigated this a bit, and it turns out to implement TMA_BACKEND, TMA_BE_MEMORY and TMA_BE_CORE, we would not about 8 counters, so it is not possible to do it now.

Why is there a limit to ACCESSMODE=perf_event. Perf_event can use multiplexing to record more than four registers, but you disabled it for some reason. Why?

TomTheBear commented 2 years ago

Short story: LIKWID uses the counter names as placeholders for the measurements when deriving metrics (PMC0+PMC1). While you could use the event name in many cases, there are some difficulties like the same event is counted twice with different counter options or multiple devices using the same event name.

Long story: There is a fundamental difference on how you look at hardware performance monitoring: from the events point-of-view and from the counters point-of-view. While perf_event has chosen the events view, LIKWID uses the counters view. For LIKWID, there cannot be more events running than there are physical registers (counters). Historically, LIKWID was developed somehow side-by-side with perf_event (both were released around 2009). So when the development of LIKWID started, there was no easy way to get hardware performance counters (there were kernel patches for predecessors of perf_event). And the kernel provided everything required to do that on your own (msr driver). In the end: we have full control, know what was programmed and do not get interfered by other tools (perf often does not use the designated counter for measuring cycles but a general-purpose counter on Intel architectures). The things changed with the addition of ARM and POWER architectures. While I had a working user-space interface for POWER (only for little-endian configurations) and ARM7 (at that time), the main issue was that the access to hardware performance counters in user-space required the loading of a kernel module. From my experience and I fully support that, custom kernel modules are not or only rarely loaded on professionally managed systems. So in order to support POWER and ARM, I added perf_event with the full knowledge that it will cause questions in the future. There are other problems like accessdaemon/direct mode counts anything running on a hardware thread while perf_event can limit it to the process (and its children).

I totally understand that there are features in the perf_event world that are helpful. Multiplexing is not one of them.