Open jdomke opened 7 months ago
note: using FCC results in similar crashes
The issue results from having disabled cores in a 24-core version of A64FX (the chip has all 48 nodes, but only 24 are active). Unlike on Intel/AMD the kernel does not properly mask/map the coreIDs to be consecutive. Visible here:
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1
for one of the CMGs of the chip.
I was able to "fix" this part with
diff --git a/src/topology_proc.c b/src/topology_proc.c
index 398be11f..77fa871a 100644
--- a/src/topology_proc.c
+++ b/src/topology_proc.c
@@ -602,6 +602,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
int (*ownatoi)(const char*);
ownatoi = &atoi;
int last_socket = -1;
+ int last_coreid = -1;
int num_sockets = 0;
int num_cores_per_socket = 0;
int num_threads_per_core = 0;
@@ -631,6 +632,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
{
num_sockets++;
last_socket = packageId;
+ last_coreid = -1;
}
fclose(fp);
}
@@ -639,7 +641,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
if (NULL != (fp = fopen (bdata(file), "r")))
{
bstring src = bread ((bNread) fread, fp);
- hwThreadPool[i].coreId = ownatoi(bdata(src));
+ hwThreadPool[i].coreId = (++last_coreid); //ownatoi(bdata(src));
if (hwThreadPool[i].packageId == 0)
{
num_cores_per_socket++;
but it will only move the error to other parts of the code. I think likwid has severe issues when cores, sockets, cachedomains, etc. are not in idea conditions.
Describe the bug likwid-perfctr throws different Aborted (core dumped) errors depending on runtime of the sleep command
To Reproduce
To Reproduce with a LIKWID command Please supply the output of the command with
-V 3
added to the command:+------------------------------------+------------+ | Metric | HWThread 0 | +------------------------------------+------------+ | Runtime (RDTSC) [s] | 1.0025 | | CPI | 1.7634 | | L1D<-L2 load bandwidth [MBytes/s] | 0.8843 | | L1D<-L2 load data volume [GBytes] | 0.0009 | | L1D->L2 evict bandwidth [MBytes/s] | 0.3248 | | L1D->L2 evict data volume [GBytes] | 0.0003 | | L1I<-L2 load bandwidth [MBytes/s] | 0.9895 | | L1I<-L2 load data volume [GBytes] | 0.0010 | | L1<->L2 bandwidth [MBytes/s] | 2.1986 | | L1<->L2 data volume [GBytes] | 0.0022 | +------------------------------------+------------+
double free or corruption (out)