Closed dimasique1 closed 1 year ago
Thanks for your request. Your text is quite short, so hard to tell what you actually want. As you mention, it is already supported. The problem is probably, that the architecture identifiers are not yet registered in LIKWID.
Take a look at the Adding ARM chips wiki page to see which information is required. You can try yourself (PR welcome) or send the required information.
https://github.com/RRZE-HPC/likwid/wiki/AddARMSupport#add-hardware-topology-information https://github.com/RRZE-HPC/likwid/wiki/AddARMSupport#registering-chip-in-performance-monitoring-module
I know this thread was not active for a while, but in case this might be helpful, the current master
branch does work on my Cortex A72 system and does include some small changes concerning ARM processors. So, if you didn't already, maybe try building from its state.
Also, are you sure you are actually running in ARMv8 mode? I'm asking because I also ran into some issues while trying to build for the Raspberry Pi4B (Cortex A72) since I did not know then that it actually runs in ARMv7 mode if you have the default 32bit OS installed. You can check which version you are running using the uname -m
command. This should report aarch64
if you are running a ARMv8 system. The perf_event
API only supports the Cortex A72 in ARMv8 mode as far as I know, so LIKWID won't be able to read hardware events if that is not the case.
Thanks @OoJJBoO for your comment. Yes, the ARM A72 is already supported but it seems his system is not detected as A72.
I was able to do measurements on an A72 in ARMv7 mode. LIKWID detects it as A53 then.
You are of course right @TomTheBear, didn't remember that. But "native" support, by which I mean that the CPU gets detected as the correct one, still only works with a 64bit OS or at least a set 64bit boot flag in case of the 32bit Raspberry Pi OS, since not the needed cpu specific event directories that are used by LIKWID are present in 32bit mode, but only some generic ARMv7(L) ones.
Still, like you said, counters should be readable when falling back to a ARMv7 build.
Hi @TomTheBear and Others, I am trying to install Likwid on my Raspberry Pi Model 4B, which as @OoJJBoO stated is based on the ARM Cortex A72 processor. I get several errors when I run the make from the likwid directory and I am hoping you can help me to resolve them. Thanks in advance for your help.
This is what I did: I downloaded the source code by issuing the gt command and cd into the likwid directory, then I run the make command. $ gh repo clone RRZE-HPC/likwid $ cd likwid/ $ make
This is the output:
===> GENERATE HEADER GCC/perfmon_a15_events.h
===> GENERATE HEADER GCC/perfmon_a57_events.h
===> GENERATE HEADER GCC/perfmon_a64fx_events.h
===> GENERATE HEADER GCC/perfmon_atom_events.h
===> GENERATE HEADER GCC/perfmon_broadwellEP_events.h
===> GENERATE HEADER GCC/perfmon_broadwell_events.h
===> GENERATE HEADER GCC/perfmon_broadwelld_events.h
===> GENERATE HEADER GCC/perfmon_cascadelakeX_events.h
===> GENERATE HEADER GCC/perfmon_cavtx2_events.h
===> GENERATE HEADER GCC/perfmon_core2_events.h
===> GENERATE HEADER GCC/perfmon_goldmont_events.h
===> GENERATE HEADER GCC/perfmon_haswellEP_events.h
===> GENERATE HEADER GCC/perfmon_haswell_events.h
===> GENERATE HEADER GCC/perfmon_icelakeX_events.h
===> GENERATE HEADER GCC/perfmon_icelake_events.h
===> GENERATE HEADER GCC/perfmon_interlagos_events.h
===> GENERATE HEADER GCC/perfmon_ivybridgeEP_events.h
===> GENERATE HEADER GCC/perfmon_ivybridge_events.h
===> GENERATE HEADER GCC/perfmon_k10_events.h
===> GENERATE HEADER GCC/perfmon_k8_events.h
===> GENERATE HEADER GCC/perfmon_kabini_events.h
===> GENERATE HEADER GCC/perfmon_knl_events.h
===> GENERATE HEADER GCC/perfmon_nehalemEX_events.h
===> GENERATE HEADER GCC/perfmon_nehalem_events.h
===> GENERATE HEADER GCC/perfmon_neon1_events.h
===> GENERATE HEADER GCC/perfmon_p6_events.h
===> GENERATE HEADER GCC/perfmon_phi_events.h
===> GENERATE HEADER GCC/perfmon_pm_events.h
===> GENERATE HEADER GCC/perfmon_power8_events.h
===> GENERATE HEADER GCC/perfmon_power9_events.h
===> GENERATE HEADER GCC/perfmon_sandybridgeEP_events.h
===> GENERATE HEADER GCC/perfmon_sandybridge_events.h
===> GENERATE HEADER GCC/perfmon_silvermont_events.h
===> GENERATE HEADER GCC/perfmon_skylakeX_events.h
===> GENERATE HEADER GCC/perfmon_skylake_events.h
===> GENERATE HEADER GCC/perfmon_tigerlake_events.h
===> GENERATE HEADER GCC/perfmon_westmereEX_events.h
===> GENERATE HEADER GCC/perfmon_westmere_events.h
===> GENERATE HEADER GCC/perfmon_zen2_events.h
===> GENERATE HEADER GCC/perfmon_zen3_events.h
===> GENERATE HEADER GCC/perfmon_zen4_events.h
===> GENERATE HEADER GCC/perfmon_zen_events.h
===> COMPILE GCC/access.o
===> COMPILE GCC/access_client.o
===> COMPILE GCC/access_x86.o
===> COMPILE GCC/access_x86_clientmem.o
===> COMPILE GCC/access_x86_mmio.o
===> COMPILE GCC/access_x86_msr.o
===> COMPILE GCC/access_x86_pci.o
===> COMPILE GCC/access_x86_rdpmc.o
In function ‘__rdpmc’,
inlined from ‘test_rdpmc.constprop’ at /home/egallegos/likwid/src/access_x86_rdpmc.c:123:13:
/home/egallegos/likwid/src/access_x86_rdpmc.c:77:5: error: impossible constraint in ‘asm’
77 | asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
| ^~~
/home/egallegos/likwid/src/access_x86_rdpmc.c:77:5: error: impossible constraint in ‘asm’
===> COMPILE GCC/affinity.o
===> COMPILE GCC/bitUtil.o
===> COMPILE GCC/bstrlib.o
===> COMPILE GCC/bstrlib_helper.o
===> COMPILE GCC/calculator.o
===> COMPILE GCC/calculator_stack.o
===> COMPILE GCC/configuration.o
===> COMPILE GCC/cpuFeatures.o
===> COMPILE GCC/cpustring.o
===> COMPILE GCC/frequency_cpu.o
===> COMPILE GCC/frequency_uncore.o
===> COMPILE GCC/ghash.o
===> COMPILE GCC/hashTable.o
===> COMPILE GCC/hwFeatures.o
===> COMPILE GCC/libperfctr.o
===> COMPILE GCC/luawid.o
===> COMPILE GCC/map.o
===> COMPILE GCC/memsweep.o
===> COMPILE GCC/numa.o
===> COMPILE GCC/numa_hwloc.o
===> COMPILE GCC/numa_proc.o
===> COMPILE GCC/numa_virtual.o
===> COMPILE GCC/pci_hwloc.o
===> COMPILE GCC/pci_proc.o
===> COMPILE GCC/perfgroup.o
===> COMPILE GCC/perfmon.o
===> COMPILE GCC/power.o
===> COMPILE GCC/thermal.o
===> COMPILE GCC/timer.o
===> COMPILE GCC/topology.o
===> COMPILE GCC/topology_cpuid.o
/home/egallegos/likwid/src/topology_cpuid.c: In function ‘intelCpuidFunc_4’:
/home/egallegos/likwid/src/topology_cpuid.c:75:9: warning: implicit declaration of function ‘CPUID’ [-Wimplicit-function-declaration]
75 | CPUID(eax, ebx, ecx, edx);
| ^~~~~
===> COMPILE GCC/topology_hwloc.o
===> COMPILE GCC/topology_proc.o
===> COMPILE GCC/tree.o
===> COMPILE GCC/voltage.o
===> COMPILE GCC/loadData.o
/home/egallegos/likwid/src/loadData.S: Assembler messages:
/home/egallegos/likwid/src/loadData.S:1: Error: unknown pseudo-op: `.intel_syntax'
===> ENTER /home/egallegos/likwid/ext/hwloc
In file included from ./hwloc/topology-x86.c:22:
./include/private/cpuid-x86.h: In function ‘likwid_hwloc_x86_cpuid’:
./include/private/cpuid-x86.h:81:2: error: #error unknown architecture
81 | #error unknown architecture
| ^~~~~
In file included from ./include/hwloc.h:66,
from ./hwloc/topology-x86.c:18:
./hwloc/topology-x86.c: In function ‘hwloc_look_x86’:
./include/hwloc/autogen/config.h:219:26: warning: implicit declaration of function ‘likwid_hwloc_have_x86_cpuid’; did you mean ‘likwid_hwloc_x86_cpuid’? [-Wimplicit-function-declaration]
219 | #define HWLOC_SYMPREFIX likwid
| ^~~
./include/hwloc/rename.h:29:33: note: in definition of macro ‘HWLOC_MUNGE_NAME2’
29 | #define HWLOC_MUNGE_NAME2(a, b) a ## b
| ^
./include/hwloc/rename.h:30:26: note: in expansion of macro ‘HWLOC_MUNGE_NAME’
30 | #define HWLOC_NAME(name) HWLOC_MUNGE_NAME(HWLOC_SYMPREFIX, hwloc ## name)
| ^~~~
./include/hwloc/rename.h:30:43: note: in expansion of macro ‘HWLOC_SYM_PREFIX’
30 | #define HWLOC_NAME(name) HWLOC_MUNGE_NAME(HWLOC_SYMPREFIX, hwloc ## name)
| ^~~~
./include/hwloc/rename.h:633:30: note: in expansion of macro ‘HWLOC_NAME’
633 | #define hwloc_have_x86_cpuid HWLOC_NAME(have_x86_cpuid)
| ^~~~~~
./hwloc/topology-x86.c:1404:26: note: in expansion of macro ‘hwloc_have_x86_cpuid’
1404 | if (!src_cpuiddump && !hwloc_have_x86_cpuid())
| ^~~~~~~~
make[1]: [Makefile:74: GCC/topology-x86.o] Error 1
make: [Makefile:288: /home/egallegos/likwid/ext/hwloc/liblikwid-hwloc.so] Error 2
Here is some information about my system: $ uname -m aarch64 $ cat /proc/cpuinfo processor : 0 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 1 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 2 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 3 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
Hardware : BCM2835 Revision : c03111 Serial : 10000000d74b7af6 Model : Raspberry Pi 4 Model B Rev 1.1
Switch COMPILER
in config.mk
to GCCARMv8
. Afterwards make distclean && make
Hi @TomTheBear, Thank you very much for your help. I followed your instructions and the build part worked. I tried to test the installation by running the cloverleaf mini app and I run into a problem. It complained about the perf_event_paranoid - 4, see below: $ sudo likwid-perfctr -C 0-87 -g MEM_DP ./clover_leaf [sudo] password for egallegos: Cannot use performance monitoring with perf_event_paranoid = 4
I fixed this issue with the following command, sudo sysctl -w kernel.perf_event_paranoid=2
I tried with 3 and it did not work but it worked with 2.
CPU name: BCM2835 CPU type: ARM Cortex A72 CPU clock: 0.00 GHz ERROR - [/home/egallegos/likwid/src/perfgroup.c:perfgroup_readGroup:858] No such file or directory. Cannot read group file MEM_DP.txt. Searched in /usr/local/share/likwid/perfgroups/arm8/MEM_DP.txt and /root/.likwid/groups/arm8/MEM_DP.txt ERROR - [/home/egallegos/likwid/src/perfmon.c:perfmon_addEventSet:2229] No such file or directory. Access to performance group MEM_DP not allowed
Any suggestions on how to fix this problem will be very appreciated. Thanks.
This is expected behavior, search for perf_event_paranoid
on the perf_event_open manpage. In short: Lower value -> more permissions to users. LIKWID requires at least a value of 2
to run core-local counters. For uncore counters (like memory controllers), you need at least a value of 0
.
BUT: The A72 architecture does not provide enough events to set up a MEM_DP
group. All ARM chips provide a basic set of events. This set can be extended by the chip vendors. The basic set does not contain reliable FP events and also no useful memory events. And as far as I remember: Broadcom did not extend the set for the BCM2835.
There are reasonably named events MEM_ACCESS_LD
and MEM_ACCESS_ST
but they are not reliable. Measure MEM_ACCESS_LD:PMC0,MEM_ACCESS_ST:PMC1
and compare to LD_SPEC:PMC0,ST_SPEC:PMC1
. If the counts match, the MEM_ACCESS*
events are wired to "loads/stores to the L1 cache". It might also be some other load and store event, there are a few but, as far as I remember, the MEM_ACCESS*
events did not only count for memory accesses.
I added some documentation about the perf_event_paranoid
settings and LIKWID: https://github.com/RRZE-HPC/likwid/wiki/TutorialLikwidPerf#how-is-counter-access-controlled
Thank you very much for the explanation and the links to the documentation, that is very helpful. I run the suggested test, the results are close but I am not sure if that mean the loads/stores are wired to L1 cache. below is the test and output.
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo sysctl -w kernel.perf_event_paranoid=2
kernel.perf_event_paranoid = 2
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ likwid-pin -c S0:3 -p
3
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo likwid-perfctr -C 0-10 -g MEM_ACCESS_LD:PMC0,MEM_ACCESS_ST:PMC1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name: BCM2835
CPU type: ARM Cortex A72
CPU clock: 0.00 GHz
--------------------------------------------------------------------------------
Clover Version 1.300
MPI Version
Task Count 1
Clover Version 1.300
MPI Version
Task Count 1
Output file clover.out opened. All output will go there.
--------------------------------------------------------------------------------
Group 1: Custom
+---------------------+---------+--------------+--------------+--------------+--------------+
| Event | Counter | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 |
+---------------------+---------+--------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] | TSC | 7.903841e+00 | 7.903841e+00 | 7.903841e+00 | 7.903841e+00 |
| MEM_ACCESS_LD | PMC0 | 0 | 0 | 79 | 568088 |
| MEM_ACCESS_ST | PMC1 | 0 | 0 | 59 | 59582307 |
+---------------------+---------+--------------+--------------+--------------+--------------+
+--------------------------+---------+----------+--------+----------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+--------------------------+---------+----------+--------+----------+--------------+
| Runtime (RDTSC) [s] STAT | TSC | 31.6154 | 7.9038 | 7.9038 | 7.9038 |
| MEM_ACCESS_LD STAT | PMC0 | 568167 | 0 | 568088 | 142041.7500 |
| MEM_ACCESS_ST STAT | PMC1 | 59582366 | 0 | 59582307 | 1.489559e+07 |
+--------------------------+---------+----------+--------+----------+--------------+
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo likwid-perfctr -C 0-10 -g LD_SPEC:PMC0,ST_SPEC:PMC1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name: BCM2835
CPU type: ARM Cortex A72
CPU clock: 0.00 GHz
--------------------------------------------------------------------------------
Clover Version 1.300
MPI Version
Task Count 1
Clover Version 1.300
MPI Version
Task Count 1
Output file clover.out opened. All output will go there.
--------------------------------------------------------------------------------
Group 1: Custom
+---------------------+---------+--------------+--------------+--------------+--------------+
| Event | Counter | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 |
+---------------------+---------+--------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] | TSC | 7.836951e+00 | 7.836951e+00 | 7.836951e+00 | 7.836951e+00 |
| LD_SPEC | PMC0 | 0 | 8 | 0 | 619553 |
| ST_SPEC | PMC1 | 0 | 1 | 0 | 75080645 |
+---------------------+---------+--------------+--------------+--------------+--------------+
+--------------------------+---------+----------+--------+----------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+--------------------------+---------+----------+--------+----------+--------------+
| Runtime (RDTSC) [s] STAT | TSC | 31.3478 | 7.8370 | 7.8370 | 7.8370 |
| LD_SPEC STAT | PMC0 | 619561 | 0 | 619553 | 154890.2500 |
| ST_SPEC STAT | PMC1 | 75080646 | 0 | 75080645 | 1.877016e+07 |
+--------------------------+---------+----------+--------+----------+--------------+
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$
I also run other basic tests that indicate the installation is good and detects the correct CPU.
egallegos@luna:~$ likwid-topology
--------------------------------------------------------------------------------
CPU name: BCM2835
CPU type: ARM Cortex A72
CPU stepping: 3
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets: 1
Cores per socket: 4
Threads per core: 1
--------------------------------------------------------------------------------
HWThread Thread Core Die Socket Available
0 0 0 0 0 *
1 0 1 0 0 *
2 0 2 0 0 *
3 0 3 0 0 *
--------------------------------------------------------------------------------
Socket 0: ( 0 1 2 3 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level: 1
Size: 32 kB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 )
--------------------------------------------------------------------------------
Level: 2
Size: 1 MB
Cache groups: ( 0 1 2 3 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains: 1
--------------------------------------------------------------------------------
Domain: 0
Processors: ( 0 1 2 3 )
Distances: 10
Free memory: 3477.34 MB
Total memory: 3789.41 MB
--------------------------------------------------------------------------------
egallegos@luna:~$ likwid-mpirun -omp gnu -n 4 ./helloworld-mpi
Hello World! I am processor luna, rank 0 of 4 processors
Hello World! I am processor luna, rank 1 of 4 processors
Hello World! I am processor luna, rank 2 of 4 processors
Hello World! I am processor luna, rank 3 of 4 processors
egallegos@luna:~$
Thank you again for your support.
Please use a benchmark where you can control the numbers, like likwid-bench
.
likwid-perfctr -C 0 -g MEM_ACCESS_LD:PMC0,LD_SPEC:PMC1,MEM_ACCESS_ST:PMC2,ST_SPEC:PMC3 -m likwid-bench -t copy -W N:20kB:1
This test runs completely in L1 cache (dataset 20kB). If the MEM_ACCESS*
events are increased, they are not reliable. The *_SPEC
events might be a little higher than expected because they count speculatively executed loads/stores, not retired ones.
Based on your results: 90% of loads go into memory? 80% of writes go into memory?
The results indicate that all the writes and loads go into memory, here are the results.
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ likwid-perfctr -C 0-87 -g MEM_ACCESS_LD:PMC0,LD_SPEC:PMC1,MEM_ACCESS_ST:PMC2,ST_SPEC:PMC3 -m likwid-bench -t copy -W N:20kB:1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name: BCM2835
CPU type: ARM Cortex A72
CPU clock: 0.00 GHz
--------------------------------------------------------------------------------
Allocate: Process running on hwthread 0 (Domain N) - Vector length 1250/10000 Offset 0 Alignment 512
Allocate: Process running on hwthread 0 (Domain N) - Vector length 1250/10000 Offset 0 Alignment 512
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Using Likwid Marker API
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 1250 Offset 0
--------------------------------------------------------------------------------
Cycles: 1780788
CPU Clock: 917356
Cycle Clock: 0
Time: 1.815472e+00 sec
Iterations: 1048576
Iterations per thread: 1048576
Inner loop executions: 1250
Size (Byte): 20000
Size per thread: 20000
Number of Flops: 0
MFlops/s: 0.00
Data volume (Byte): 20971520000
MByte/s: 11551.55
Cycles per update: 0.001359
Cycles per cacheline: 0.010869
Loads per update: 1
Stores per update: 1
Load bytes per element: 8
Store bytes per elem.: 8
Load/store ratio: 1.00
Instructions: 14417920016
UOPs: 10485760000
--------------------------------------------------------------------------------
Writing Likwid Marker API results to file /tmp/likwid_1824.txt
--------------------------------------------------------------------------------
Region bench, Group 1: Custom
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 1.780789 |
| call count | 1 |
+-------------------+------------+
+---------------------+---------+--------------+
| Event | Counter | HWThread 0 |
+---------------------+---------+--------------+
| Runtime (RDTSC) [s] | TSC | 1.780789e+00 |
| MEM_ACCESS_LD | PMC0 | 1324355000 |
| LD_SPEC | PMC1 | 1328546000 |
| MEM_ACCESS_ST | PMC2 | 1330633000 |
| ST_SPEC | PMC3 | 1325399000 |
+---------------------+---------+--------------+
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$
This was my expectation and proofs my point. The benchmark uses only a dataset size of 20kB, so there will be no memory traffic (after the initial fetch of the 20kB). All data should stay in the L1 cache of the single core. But the MEM_ACCESS*
events are counting in the same fashion as the LD/ST_SPEC
events. This means, the MEM_ACCESS*
events are not reliably counting actual memory access but something else that is in line with load and stores.
Why do you need support for this specific architecture? Those are modern ARM CPUs
Which architecture model, family and further information? CPU or accelerator? Cortex-A72, Cortex-A78
Is the documentation of the hardware counters publicly available? Yes it is
Are there already any usable tools (commercial or open-source)? perf
Even though ARMv8 is supported in the likwid, I see the following error on my A72:
likwid-perfctr -e