RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.67k stars 229 forks source link

Add Cortex-A72/78 support #486

Closed dimasique1 closed 1 year ago

dimasique1 commented 2 years ago

Why do you need support for this specific architecture? Those are modern ARM CPUs

Which architecture model, family and further information? CPU or accelerator? Cortex-A72, Cortex-A78

Is the documentation of the hardware counters publicly available? Yes it is

Are there already any usable tools (commercial or open-source)? perf

Even though ARMv8 is supported in the likwid, I see the following error on my A72:

likwid-perfctr -e

This architecture has 0 counters. Counter tags(name, type<, options>):

This architecture has 0 events. Event tags (tag, id, umask, counters<, options>):

TomTheBear commented 2 years ago

Thanks for your request. Your text is quite short, so hard to tell what you actually want. As you mention, it is already supported. The problem is probably, that the architecture identifiers are not yet registered in LIKWID.

Take a look at the Adding ARM chips wiki page to see which information is required. You can try yourself (PR welcome) or send the required information.

https://github.com/RRZE-HPC/likwid/wiki/AddARMSupport#add-hardware-topology-information https://github.com/RRZE-HPC/likwid/wiki/AddARMSupport#registering-chip-in-performance-monitoring-module

OoJJBoO commented 2 years ago

I know this thread was not active for a while, but in case this might be helpful, the current master branch does work on my Cortex A72 system and does include some small changes concerning ARM processors. So, if you didn't already, maybe try building from its state.

Also, are you sure you are actually running in ARMv8 mode? I'm asking because I also ran into some issues while trying to build for the Raspberry Pi4B (Cortex A72) since I did not know then that it actually runs in ARMv7 mode if you have the default 32bit OS installed. You can check which version you are running using the uname -m command. This should report aarch64 if you are running a ARMv8 system. The perf_event API only supports the Cortex A72 in ARMv8 mode as far as I know, so LIKWID won't be able to read hardware events if that is not the case.

TomTheBear commented 2 years ago

Thanks @OoJJBoO for your comment. Yes, the ARM A72 is already supported but it seems his system is not detected as A72.

I was able to do measurements on an A72 in ARMv7 mode. LIKWID detects it as A53 then.

OoJJBoO commented 2 years ago

You are of course right @TomTheBear, didn't remember that. But "native" support, by which I mean that the CPU gets detected as the correct one, still only works with a 64bit OS or at least a set 64bit boot flag in case of the 32bit Raspberry Pi OS, since not the needed cpu specific event directories that are used by LIKWID are present in 32bit mode, but only some generic ARMv7(L) ones.

Still, like you said, counters should be readable when falling back to a ARMv7 build.

gallegos001 commented 1 year ago

Hi @TomTheBear and Others, I am trying to install Likwid on my Raspberry Pi Model 4B, which as @OoJJBoO stated is based on the ARM Cortex A72 processor. I get several errors when I run the make from the likwid directory and I am hoping you can help me to resolve them. Thanks in advance for your help.

This is what I did: I downloaded the source code by issuing the gt command and cd into the likwid directory, then I run the make command. $ gh repo clone RRZE-HPC/likwid $ cd likwid/ $ make

This is the output: ===> GENERATE HEADER GCC/perfmon_a15_events.h ===> GENERATE HEADER GCC/perfmon_a57_events.h ===> GENERATE HEADER GCC/perfmon_a64fx_events.h ===> GENERATE HEADER GCC/perfmon_atom_events.h ===> GENERATE HEADER GCC/perfmon_broadwellEP_events.h ===> GENERATE HEADER GCC/perfmon_broadwell_events.h ===> GENERATE HEADER GCC/perfmon_broadwelld_events.h ===> GENERATE HEADER GCC/perfmon_cascadelakeX_events.h ===> GENERATE HEADER GCC/perfmon_cavtx2_events.h ===> GENERATE HEADER GCC/perfmon_core2_events.h ===> GENERATE HEADER GCC/perfmon_goldmont_events.h ===> GENERATE HEADER GCC/perfmon_haswellEP_events.h ===> GENERATE HEADER GCC/perfmon_haswell_events.h ===> GENERATE HEADER GCC/perfmon_icelakeX_events.h ===> GENERATE HEADER GCC/perfmon_icelake_events.h ===> GENERATE HEADER GCC/perfmon_interlagos_events.h ===> GENERATE HEADER GCC/perfmon_ivybridgeEP_events.h ===> GENERATE HEADER GCC/perfmon_ivybridge_events.h ===> GENERATE HEADER GCC/perfmon_k10_events.h ===> GENERATE HEADER GCC/perfmon_k8_events.h ===> GENERATE HEADER GCC/perfmon_kabini_events.h ===> GENERATE HEADER GCC/perfmon_knl_events.h ===> GENERATE HEADER GCC/perfmon_nehalemEX_events.h ===> GENERATE HEADER GCC/perfmon_nehalem_events.h ===> GENERATE HEADER GCC/perfmon_neon1_events.h ===> GENERATE HEADER GCC/perfmon_p6_events.h ===> GENERATE HEADER GCC/perfmon_phi_events.h ===> GENERATE HEADER GCC/perfmon_pm_events.h ===> GENERATE HEADER GCC/perfmon_power8_events.h ===> GENERATE HEADER GCC/perfmon_power9_events.h ===> GENERATE HEADER GCC/perfmon_sandybridgeEP_events.h ===> GENERATE HEADER GCC/perfmon_sandybridge_events.h ===> GENERATE HEADER GCC/perfmon_silvermont_events.h ===> GENERATE HEADER GCC/perfmon_skylakeX_events.h ===> GENERATE HEADER GCC/perfmon_skylake_events.h ===> GENERATE HEADER GCC/perfmon_tigerlake_events.h ===> GENERATE HEADER GCC/perfmon_westmereEX_events.h ===> GENERATE HEADER GCC/perfmon_westmere_events.h ===> GENERATE HEADER GCC/perfmon_zen2_events.h ===> GENERATE HEADER GCC/perfmon_zen3_events.h ===> GENERATE HEADER GCC/perfmon_zen4_events.h ===> GENERATE HEADER GCC/perfmon_zen_events.h ===> COMPILE GCC/access.o ===> COMPILE GCC/access_client.o ===> COMPILE GCC/access_x86.o ===> COMPILE GCC/access_x86_clientmem.o ===> COMPILE GCC/access_x86_mmio.o ===> COMPILE GCC/access_x86_msr.o ===> COMPILE GCC/access_x86_pci.o ===> COMPILE GCC/access_x86_rdpmc.o In function ‘__rdpmc’, inlined from ‘test_rdpmc.constprop’ at /home/egallegos/likwid/src/access_x86_rdpmc.c:123:13: /home/egallegos/likwid/src/access_x86_rdpmc.c:77:5: error: impossible constraint in ‘asm’ 77 | asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter)); | ^~~ /home/egallegos/likwid/src/access_x86_rdpmc.c:77:5: error: impossible constraint in ‘asm’ ===> COMPILE GCC/affinity.o ===> COMPILE GCC/bitUtil.o ===> COMPILE GCC/bstrlib.o ===> COMPILE GCC/bstrlib_helper.o ===> COMPILE GCC/calculator.o ===> COMPILE GCC/calculator_stack.o ===> COMPILE GCC/configuration.o ===> COMPILE GCC/cpuFeatures.o ===> COMPILE GCC/cpustring.o ===> COMPILE GCC/frequency_cpu.o ===> COMPILE GCC/frequency_uncore.o ===> COMPILE GCC/ghash.o ===> COMPILE GCC/hashTable.o ===> COMPILE GCC/hwFeatures.o ===> COMPILE GCC/libperfctr.o ===> COMPILE GCC/luawid.o ===> COMPILE GCC/map.o ===> COMPILE GCC/memsweep.o ===> COMPILE GCC/numa.o ===> COMPILE GCC/numa_hwloc.o ===> COMPILE GCC/numa_proc.o ===> COMPILE GCC/numa_virtual.o ===> COMPILE GCC/pci_hwloc.o ===> COMPILE GCC/pci_proc.o ===> COMPILE GCC/perfgroup.o ===> COMPILE GCC/perfmon.o ===> COMPILE GCC/power.o ===> COMPILE GCC/thermal.o ===> COMPILE GCC/timer.o ===> COMPILE GCC/topology.o ===> COMPILE GCC/topology_cpuid.o /home/egallegos/likwid/src/topology_cpuid.c: In function ‘intelCpuidFunc_4’: /home/egallegos/likwid/src/topology_cpuid.c:75:9: warning: implicit declaration of function ‘CPUID’ [-Wimplicit-function-declaration] 75 | CPUID(eax, ebx, ecx, edx); | ^~~~~ ===> COMPILE GCC/topology_hwloc.o ===> COMPILE GCC/topology_proc.o ===> COMPILE GCC/tree.o ===> COMPILE GCC/voltage.o ===> COMPILE GCC/loadData.o /home/egallegos/likwid/src/loadData.S: Assembler messages: /home/egallegos/likwid/src/loadData.S:1: Error: unknown pseudo-op: `.intel_syntax' ===> ENTER /home/egallegos/likwid/ext/hwloc In file included from ./hwloc/topology-x86.c:22: ./include/private/cpuid-x86.h: In function ‘likwid_hwloc_x86_cpuid’: ./include/private/cpuid-x86.h:81:2: error: #error unknown architecture 81 | #error unknown architecture | ^~~~~ In file included from ./include/hwloc.h:66, from ./hwloc/topology-x86.c:18: ./hwloc/topology-x86.c: In function ‘hwloc_look_x86’: ./include/hwloc/autogen/config.h:219:26: warning: implicit declaration of function ‘likwid_hwloc_have_x86_cpuid’; did you mean ‘likwid_hwloc_x86_cpuid’? [-Wimplicit-function-declaration] 219 | #define HWLOC_SYMPREFIX likwid | ^~~ ./include/hwloc/rename.h:29:33: note: in definition of macro ‘HWLOC_MUNGE_NAME2’ 29 | #define HWLOC_MUNGE_NAME2(a, b) a ## b | ^ ./include/hwloc/rename.h:30:26: note: in expansion of macro ‘HWLOC_MUNGE_NAME’ 30 | #define HWLOC_NAME(name) HWLOC_MUNGE_NAME(HWLOC_SYMPREFIX, hwloc ## name) | ^~~~ ./include/hwloc/rename.h:30:43: note: in expansion of macro ‘HWLOC_SYM_PREFIX’ 30 | #define HWLOC_NAME(name) HWLOC_MUNGE_NAME(HWLOC_SYMPREFIX, hwloc ## name) | ^~~~ ./include/hwloc/rename.h:633:30: note: in expansion of macro ‘HWLOC_NAME’ 633 | #define hwloc_have_x86_cpuid HWLOC_NAME(have_x86_cpuid) | ^~~~~~ ./hwloc/topology-x86.c:1404:26: note: in expansion of macro ‘hwloc_have_x86_cpuid’ 1404 | if (!src_cpuiddump && !hwloc_have_x86_cpuid()) | ^~~~~~~~ make[1]: [Makefile:74: GCC/topology-x86.o] Error 1 make: [Makefile:288: /home/egallegos/likwid/ext/hwloc/liblikwid-hwloc.so] Error 2

Here is some information about my system: $ uname -m aarch64 $ cat /proc/cpuinfo processor : 0 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3

processor : 1 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3

processor : 2 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3

processor : 3 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3

Hardware : BCM2835 Revision : c03111 Serial : 10000000d74b7af6 Model : Raspberry Pi 4 Model B Rev 1.1

TomTheBear commented 1 year ago

Switch COMPILER in config.mk to GCCARMv8. Afterwards make distclean && make

gallegos001 commented 1 year ago

Hi @TomTheBear, Thank you very much for your help. I followed your instructions and the build part worked. I tried to test the installation by running the cloverleaf mini app and I run into a problem. It complained about the perf_event_paranoid - 4, see below: $ sudo likwid-perfctr -C 0-87 -g MEM_DP ./clover_leaf [sudo] password for egallegos: Cannot use performance monitoring with perf_event_paranoid = 4

I fixed this issue with the following command, sudo sysctl -w kernel.perf_event_paranoid=2

I tried with 3 and it did not work but it worked with 2.

However I am now stuck with an access issue with the MEM_DP group, $ sudo likwid-perfctr -C 0-87 -g MEM_DP ./clover_leaf

CPU name: BCM2835 CPU type: ARM Cortex A72 CPU clock: 0.00 GHz ERROR - [/home/egallegos/likwid/src/perfgroup.c:perfgroup_readGroup:858] No such file or directory. Cannot read group file MEM_DP.txt. Searched in /usr/local/share/likwid/perfgroups/arm8/MEM_DP.txt and /root/.likwid/groups/arm8/MEM_DP.txt ERROR - [/home/egallegos/likwid/src/perfmon.c:perfmon_addEventSet:2229] No such file or directory. Access to performance group MEM_DP not allowed

Any suggestions on how to fix this problem will be very appreciated. Thanks.

TomTheBear commented 1 year ago

This is expected behavior, search for perf_event_paranoid on the perf_event_open manpage. In short: Lower value -> more permissions to users. LIKWID requires at least a value of 2 to run core-local counters. For uncore counters (like memory controllers), you need at least a value of 0.

BUT: The A72 architecture does not provide enough events to set up a MEM_DP group. All ARM chips provide a basic set of events. This set can be extended by the chip vendors. The basic set does not contain reliable FP events and also no useful memory events. And as far as I remember: Broadcom did not extend the set for the BCM2835.

There are reasonably named events MEM_ACCESS_LD and MEM_ACCESS_ST but they are not reliable. Measure MEM_ACCESS_LD:PMC0,MEM_ACCESS_ST:PMC1 and compare to LD_SPEC:PMC0,ST_SPEC:PMC1. If the counts match, the MEM_ACCESS* events are wired to "loads/stores to the L1 cache". It might also be some other load and store event, there are a few but, as far as I remember, the MEM_ACCESS* events did not only count for memory accesses.

TomTheBear commented 1 year ago

I added some documentation about the perf_event_paranoid settings and LIKWID: https://github.com/RRZE-HPC/likwid/wiki/TutorialLikwidPerf#how-is-counter-access-controlled

gallegos001 commented 1 year ago

Thank you very much for the explanation and the links to the documentation, that is very helpful. I run the suggested test, the results are close but I am not sure if that mean the loads/stores are wired to L1 cache. below is the test and output.

egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo sysctl -w kernel.perf_event_paranoid=2
kernel.perf_event_paranoid = 2
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ likwid-pin -c S0:3 -p
3
egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo likwid-perfctr -C 0-10 -g MEM_ACCESS_LD:PMC0,MEM_ACCESS_ST:PMC1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name:   BCM2835
CPU type:   ARM Cortex A72
CPU clock:  0.00 GHz
--------------------------------------------------------------------------------

Clover Version    1.300
       MPI Version
   Task Count      1

Clover Version    1.300
       MPI Version
   Task Count      1

 Output file clover.out opened. All output will go there.
--------------------------------------------------------------------------------
Group 1: Custom
+---------------------+---------+--------------+--------------+--------------+--------------+
|        Event        | Counter |  HWThread 0  |  HWThread 1  |  HWThread 2  |  HWThread 3  |
+---------------------+---------+--------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] |   TSC   | 7.903841e+00 | 7.903841e+00 | 7.903841e+00 | 7.903841e+00 |
|    MEM_ACCESS_LD    |   PMC0  |            0 |            0 |           79 |       568088 |
|    MEM_ACCESS_ST    |   PMC1  |            0 |            0 |           59 |     59582307 |
+---------------------+---------+--------------+--------------+--------------+--------------+

+--------------------------+---------+----------+--------+----------+--------------+
|           Event          | Counter |    Sum   |   Min  |    Max   |      Avg     |
+--------------------------+---------+----------+--------+----------+--------------+
| Runtime (RDTSC) [s] STAT |   TSC   |  31.6154 | 7.9038 |   7.9038 |       7.9038 |
|    MEM_ACCESS_LD STAT    |   PMC0  |   568167 |      0 |   568088 |  142041.7500 |
|    MEM_ACCESS_ST STAT    |   PMC1  | 59582366 |      0 | 59582307 | 1.489559e+07 |
+--------------------------+---------+----------+--------+----------+--------------+

egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ sudo likwid-perfctr -C 0-10 -g LD_SPEC:PMC0,ST_SPEC:PMC1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name:   BCM2835
CPU type:   ARM Cortex A72
CPU clock:  0.00 GHz
--------------------------------------------------------------------------------

Clover Version    1.300
       MPI Version
   Task Count      1

Clover Version    1.300
       MPI Version
   Task Count      1

 Output file clover.out opened. All output will go there.
--------------------------------------------------------------------------------
Group 1: Custom
+---------------------+---------+--------------+--------------+--------------+--------------+
|        Event        | Counter |  HWThread 0  |  HWThread 1  |  HWThread 2  |  HWThread 3  |
+---------------------+---------+--------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] |   TSC   | 7.836951e+00 | 7.836951e+00 | 7.836951e+00 | 7.836951e+00 |
|       LD_SPEC       |   PMC0  |            0 |            8 |            0 |       619553 |
|       ST_SPEC       |   PMC1  |            0 |            1 |            0 |     75080645 |
+---------------------+---------+--------------+--------------+--------------+--------------+

+--------------------------+---------+----------+--------+----------+--------------+
|           Event          | Counter |    Sum   |   Min  |    Max   |      Avg     |
+--------------------------+---------+----------+--------+----------+--------------+
| Runtime (RDTSC) [s] STAT |   TSC   |  31.3478 | 7.8370 |   7.8370 |       7.8370 |
|       LD_SPEC STAT       |   PMC0  |   619561 |      0 |   619553 |  154890.2500 |
|       ST_SPEC STAT       |   PMC1  | 75080646 |      0 | 75080645 | 1.877016e+07 |
+--------------------------+---------+----------+--------+----------+--------------+

egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ 
gallegos001 commented 1 year ago

I also run other basic tests that indicate the installation is good and detects the correct CPU.

egallegos@luna:~$ likwid-topology
--------------------------------------------------------------------------------
CPU name:   BCM2835
CPU type:   ARM Cortex A72
CPU stepping:   3
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:        1
Cores per socket:   4
Threads per core:   1
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             2           0          0             *                
3               0             3           0          0             *                
--------------------------------------------------------------------------------
Socket 0:       ( 0 1 2 3 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:          1
Size:           32 kB
Cache groups:       ( 0 ) ( 1 ) ( 2 ) ( 3 )
--------------------------------------------------------------------------------
Level:          2
Size:           1 MB
Cache groups:       ( 0 1 2 3 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:       1
--------------------------------------------------------------------------------
Domain:         0
Processors:     ( 0 1 2 3 )
Distances:      10
Free memory:        3477.34 MB
Total memory:       3789.41 MB
--------------------------------------------------------------------------------
egallegos@luna:~$ likwid-mpirun -omp gnu -n 4 ./helloworld-mpi
Hello World! I am processor luna, rank 0 of 4 processors
Hello World! I am processor luna, rank 1 of 4 processors
Hello World! I am processor luna, rank 2 of 4 processors
Hello World! I am processor luna, rank 3 of 4 processors
egallegos@luna:~$ 

Thank you again for your support.

TomTheBear commented 1 year ago

Please use a benchmark where you can control the numbers, like likwid-bench.

likwid-perfctr -C 0 -g MEM_ACCESS_LD:PMC0,LD_SPEC:PMC1,MEM_ACCESS_ST:PMC2,ST_SPEC:PMC3 -m likwid-bench -t copy -W N:20kB:1 This test runs completely in L1 cache (dataset 20kB). If the MEM_ACCESS* events are increased, they are not reliable. The *_SPEC events might be a little higher than expected because they count speculatively executed loads/stores, not retired ones.

Based on your results: 90% of loads go into memory? 80% of writes go into memory?

gallegos001 commented 1 year ago

The results indicate that all the writes and loads go into memory, here are the results.

egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ likwid-perfctr -C 0-87 -g MEM_ACCESS_LD:PMC0,LD_SPEC:PMC1,MEM_ACCESS_ST:PMC2,ST_SPEC:PMC3 -m likwid-bench -t copy -W N:20kB:1 ./clover_leaf
--------------------------------------------------------------------------------
CPU name:   BCM2835
CPU type:   ARM Cortex A72
CPU clock:  0.00 GHz
--------------------------------------------------------------------------------
Allocate: Process running on hwthread 0 (Domain N) - Vector length 1250/10000 Offset 0 Alignment 512
Allocate: Process running on hwthread 0 (Domain N) - Vector length 1250/10000 Offset 0 Alignment 512
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Using Likwid Marker API
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 1250 Offset 0
--------------------------------------------------------------------------------
Cycles:         1780788
CPU Clock:      917356
Cycle Clock:        0
Time:           1.815472e+00 sec
Iterations:     1048576
Iterations per thread:  1048576
Inner loop executions:  1250
Size (Byte):        20000
Size per thread:    20000
Number of Flops:    0
MFlops/s:       0.00
Data volume (Byte): 20971520000
MByte/s:        11551.55
Cycles per update:  0.001359
Cycles per cacheline:   0.010869
Loads per update:   1
Stores per update:  1
Load bytes per element: 8
Store bytes per elem.:  8
Load/store ratio:   1.00
Instructions:       14417920016
UOPs:           10485760000
--------------------------------------------------------------------------------
Writing Likwid Marker API results to file /tmp/likwid_1824.txt
--------------------------------------------------------------------------------
Region bench, Group 1: Custom
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   1.780789 |
|     call count    |          1 |
+-------------------+------------+

+---------------------+---------+--------------+
|        Event        | Counter |  HWThread 0  |
+---------------------+---------+--------------+
| Runtime (RDTSC) [s] |   TSC   | 1.780789e+00 |
|    MEM_ACCESS_LD    |   PMC0  |   1324355000 |
|       LD_SPEC       |   PMC1  |   1328546000 |
|    MEM_ACCESS_ST    |   PMC2  |   1330633000 |
|       ST_SPEC       |   PMC3  |   1325399000 |
+---------------------+---------+--------------+

egallegos@luna:~/hpc/CloverLeaf/CloverLeaf_Serial$ 
TomTheBear commented 1 year ago

This was my expectation and proofs my point. The benchmark uses only a dataset size of 20kB, so there will be no memory traffic (after the initial fetch of the 20kB). All data should stay in the L1 cache of the single core. But the MEM_ACCESS* events are counting in the same fashion as the LD/ST_SPEC events. This means, the MEM_ACCESS* events are not reliably counting actual memory access but something else that is in line with load and stores.