RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.64k stars 226 forks source link

[BUG] perfctr crashes on a64fx #599

Open jdomke opened 7 months ago

jdomke commented 7 months ago

Describe the bug likwid-perfctr throws different Aborted (core dumped) errors depending on runtime of the sleep command

 $ likwid-perfctr -C 0 -g L2 sleep 1
--------------------------------------------------------------------------------
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
malloc(): unaligned tcache chunk detected
[1]+  Aborted                 (core dumped) likwid-perfctr -C 0 -g L2 sleep 1
Aborted (core dumped)
$ likwid-perfctr -C 0 -g L2 sleep 2
------------------------------------------------------------------------------
--
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: L2
+------------------+---------+------------+
<<snip>>
|    L1<->L2 data volume [GBytes]    |     0.0020 |
+------------------------------------+------------+

double free or corruption (out)
Aborted (core dumped)

To Reproduce

To Reproduce with a LIKWID command Please supply the output of the command with -V 3 added to the command:

+------------------------------------+------------+ | Metric | HWThread 0 | +------------------------------------+------------+ | Runtime (RDTSC) [s] | 1.0025 | | CPI | 1.7634 | | L1D<-L2 load bandwidth [MBytes/s] | 0.8843 | | L1D<-L2 load data volume [GBytes] | 0.0009 | | L1D->L2 evict bandwidth [MBytes/s] | 0.3248 | | L1D->L2 evict data volume [GBytes] | 0.0003 | | L1I<-L2 load bandwidth [MBytes/s] | 0.9895 | | L1I<-L2 load data volume [GBytes] | 0.0010 | | L1<->L2 bandwidth [MBytes/s] | 2.1986 | | L1<->L2 data volume [GBytes] | 0.0022 | +------------------------------------+------------+

double free or corruption (out)

jdomke commented 7 months ago

note: using FCC results in similar crashes

jdomke commented 7 months ago

The issue results from having disabled cores in a 24-core version of A64FX (the chip has all 48 nodes, but only 24 are active). Unlike on Intel/AMD the kernel does not properly mask/map the coreIDs to be consecutive. Visible here:

DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1

for one of the CMGs of the chip.

I was able to "fix" this part with

diff --git a/src/topology_proc.c b/src/topology_proc.c
index 398be11f..77fa871a 100644
--- a/src/topology_proc.c
+++ b/src/topology_proc.c
@@ -602,6 +602,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
     int (*ownatoi)(const char*);
     ownatoi = &atoi;
     int last_socket = -1;
+    int last_coreid = -1;
     int num_sockets = 0;
     int num_cores_per_socket = 0;
     int num_threads_per_core = 0;
@@ -631,6 +632,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
             {
                 num_sockets++;
                 last_socket = packageId;
+                last_coreid = -1;
             }
             fclose(fp);
         }
@@ -639,7 +641,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
         if (NULL != (fp = fopen (bdata(file), "r")))
         {
             bstring src = bread ((bNread) fread, fp);
-            hwThreadPool[i].coreId = ownatoi(bdata(src));
+            hwThreadPool[i].coreId = (++last_coreid); //ownatoi(bdata(src));
             if (hwThreadPool[i].packageId == 0)
             {
                 num_cores_per_socket++;

but it will only move the error to other parts of the code. I think likwid has severe issues when cores, sockets, cachedomains, etc. are not in idea conditions.