Summary table of likwid-perfctr shows incorrect values for "intensive" metrics

RRZE-HPC / likwid

Performance monitoring and benchmarking suite

GNU General Public License v3.0

1.65k stars 226 forks source link

Bug description likwid-perfctr incorrectly reports some metrics by adding up core- or socket-local values. This happens, e.g., with:

clock frequency
CPI
runtime
operational intensity

These are "intensive" quantities, i.e., they do not scale with the size of the machine but need to be "averaged" (not literally, of course) in the proper way. In contrast, "extensive" quantities like energy consumption, memory data volume, etc, can be added across the machine to yield a useful number.

To Reproduce

LIKWID command and/or API usage
- likwid-perfctr -g MEM_DP -C M0:0@M1:0 likwid-bench -t triad_avx -W N:2GB:2 on dual-socket Ice Lake 6326
- Operational intensity is correct on each domain separately, but the reported value is twice as high
- Same for clock, runtime, CPI (but on a HW thread basis, so the deviation is even stronger with more threads)
LIKWID version: 5.2.2
Operating system Ubuntu 22.04 LTS
Are you using the MarkerAPI (CPU code instrumentation) or the NvMarkerAPI (Nvidia GPU code instrumentation)?
- yes, but that does not matter

Suggestion

Generalize the formuals by which metrics are calculated and make them configurable as to how different entities (threads, socketc, ...) are handled. For example,operational intensity could be calculated as sth like "sum(flops, all cores)/sum(traffic, all domains)". Clock could be "sum(cycles,all HW threads)/(timenoOfThreads)", CPI could be "sum(cycles,all HW threads)/(noOfThreadssum(instructions, all HW threads))" etc. This will reduce hard-coded stuff but will make config files more complex.

Thanks for your suggestion. I thought about it but it will not be in the upcoming 5.3 version.

While the internal calculator would already support functions like SUM(X,Y,Z) or MIN(X,Y,Z), the integration of data from other threads can be problematic. Especially in the MarkerAPI where each thread updates its own values. One has to synchronize the threads after the counter readings to ensure valid metric values.

In order to reduce the changes to the internal calculator, one could use a two-step approach. When creating the internal group structure, we could expand the proposed syntax SUM(<countername>, <topological-info) to SUM(<countername>_<hw0>, <countername>_<hw1>, ...) with <hw*> being the responsible HW threads for the topological level. This way, we can still use the internal calculator for the final calculation. Of course, it still increases the work in each metric evaluation because we would need to fill the variables map (countername -> value) with the values of all HW threads. In case of modern systems with 100s of HW threads, this will cause quite some overhead.

Moreover, it does not change the way the statistics table is calculated and it is questionable whether it is still required at all. All threads would have the same CPI, Clock, etc. Calculating min, max, mean does not make sense for those or one has to magically transform SUM(cycles, all HW threads) to e.g. MIN(cycles, all HW threads) and re-calculate for the statistics table.

RRZE-HPC / likwid

Summary table of likwid-perfctr shows incorrect values for "intensive" metrics #539