RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.65k stars 226 forks source link

[BUG] malloc error when using likwid-pin on ARM Jetson AGX Xavier (ARMv8) #488

Open SF-N opened 1 year ago

SF-N commented 1 year ago

With likwid-pin -- Version 5.2.2 (commit: 233ab943543480cd46058b34616c174198ba0459) I get the following error on an ARMv8 processor (using Linux) just in the beginning before the program starts: e.g. when calling likwid-pin -c S0:0-3 ./executable

malloc(): invalid size (unsorted)
Aborted (core dumped)
TomTheBear commented 1 year ago

It is hart to tell where the problem is exactly. It might be the build options for the Lua interpreter or inside the LIKWID library. In order to find the locations:

As soon as it stops, type bt for backtrace and supply the output.

SF-N commented 1 year ago

Thanks, when trying this inside the likwid-5.2.2 folder, I get:

icarus@ubuntu:~/likwid-5.2.2$ gdb likwid-lua
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from likwid-lua...
(gdb) r likwid-pin -c S0:0-3 ./linpackc 
Starting program: /usr/local/bin/likwid-lua likwid-pin -c S0:0-3 ./linpackc
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
malloc(): invalid size (unsorted)

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000fffff7e10aac in __GI_abort () at abort.c:79
#2  0x0000fffff7e5df40 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xfffff7f1f518 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x0000fffff7e65344 in malloc_printerr (str=str@entry=0xfffff7f1b4b0 "malloc(): invalid size (unsorted)") at malloc.c:5347
#4  0x0000fffff7e67edc in _int_malloc (av=av@entry=0xfffff7f5ea98 <main_arena>, bytes=bytes@entry=32) at malloc.c:3736
#5  0x0000fffff7e694ac in __GI___libc_malloc (bytes=32) at malloc.c:3058
#6  0x0000fffff6d5aa88 in create_lookups () at ./src/affinity.c:201
#7  0x0000fffff6d5c604 in affinity_init () at ./src/affinity.c:645
#8  0x0000fffff6d46d90 in lua_likwid_getNumaInfo (L=0xaaaaaaab52a8) at ./src/luawid.c:1172
#9  0x0000fffff7f8fb9c in luaD_precall (L=L@entry=0xaaaaaaab52a8, func=func@entry=0xaaaaaaabc7a0, nresults=nresults@entry=1) at ./src/ldo.c:360
#10 0x0000fffff7fa3968 in luaV_execute (L=L@entry=0xaaaaaaab52a8) at ./src/lvm.c:1115
#11 0x0000fffff7f8ffbc in luaD_call (L=L@entry=0xaaaaaaab52a8, func=<optimized out>, nResults=<optimized out>) at ./src/ldo.c:491
#12 0x0000fffff7f90000 in luaD_callnoyield (L=0xaaaaaaab52a8, func=<optimized out>, nResults=<optimized out>) at ./src/ldo.c:501
#13 0x0000fffff7f8f37c in luaD_rawrunprotected (L=L@entry=0xaaaaaaab52a8, f=f@entry=0xfffff7fa82b8 <f_call>, ud=ud@entry=0xffffffffeb18) at ./src/ldo.c:142
#14 0x0000fffff7f90298 in luaD_pcall (L=L@entry=0xaaaaaaab52a8, func=func@entry=0xfffff7fa82b8 <f_call>, u=u@entry=0xffffffffeb18, old_top=80, ef=<optimized out>) at ./src/ldo.c:722
#15 0x0000fffff7fa9c34 in lua_pcallk (L=0xaaaaaaab52a8, nargs=<optimized out>, nresults=-1, errfunc=<optimized out>, ctx=<optimized out>, k=<optimized out>) at ./src/lapi.c:968
#16 0x0000aaaaaaaa1ad4 in docall (L=0xaaaaaaab52a8, narg=3, nres=-1) at ./src/lua.c:203
#17 0x0000aaaaaaaa2810 in handle_script (argv=<optimized out>, L=0xaaaaaaab52a8) at ./src/lua.c:443
#18 pmain (L=0xaaaaaaab52a8) at ./src/lua.c:577
#19 0x0000fffff7f8fb9c in luaD_precall (L=L@entry=0xaaaaaaab52a8, func=0xaaaaaaab58d0, nresults=1) at ./src/ldo.c:360
#20 0x0000fffff7f8ff80 in luaD_call (L=L@entry=0xaaaaaaab52a8, func=<optimized out>, nResults=<optimized out>) at ./src/ldo.c:490
#21 0x0000fffff7f90000 in luaD_callnoyield (L=0xaaaaaaab52a8, func=<optimized out>, nResults=<optimized out>) at ./src/ldo.c:501
#22 0x0000fffff7f8f37c in luaD_rawrunprotected (L=L@entry=0xaaaaaaab52a8, f=f@entry=0xfffff7fa82b8 <f_call>, ud=ud@entry=0xffffffffee68) at ./src/ldo.c:142
#23 0x0000fffff7f90298 in luaD_pcall (L=L@entry=0xaaaaaaab52a8, func=func@entry=0xfffff7fa82b8 <f_call>, u=u@entry=0xffffffffee68, old_top=16, ef=<optimized out>) at ./src/ldo.c:722
#24 0x0000fffff7fa9c34 in lua_pcallk (L=0xaaaaaaab52a8, nargs=<optimized out>, nresults=1, errfunc=<optimized out>, ctx=<optimized out>, k=<optimized out>) at ./src/lapi.c:968
#25 0x0000aaaaaaaa1878 in main (argc=5, argv=0xfffffffff008) at ./src/lua.c:603
(gdb) 

Does this help?

TomTheBear commented 1 year ago

Yes, it helps, thank you.

But it is still not easy to find the problem. The only reason I could think of would be if the detection "how many hardware threads the system" has failed. This is of course a fundamental information and I'm surprised it reaches that point without that info.

Can you please send me the content of /proc/cpuinfo. It might be a bug earlier in the execution in the cpuinfo parser on ARM.

SF-N commented 1 year ago
icarus@ubuntu:/$ cat /proc/cpuinfo 
processor   : 0
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 1
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 2
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 3
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 4
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 5
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 6
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613

processor   : 7
model name  : ARMv8 Processor rev 0 (v8l)
BogoMIPS    : 62.50
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm dcpop
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x004
CPU revision    : 0
MTS version : 55637613
TomTheBear commented 1 year ago

I tested the current parser and found a bug but it is not relevant for your issue.

Can you please do the gdb thing again and when it fails do a print(cputopo->numHWThreads)

And provide the content of the files /sys/devices/system/cpu/present and /sys/devices/system/cpu/online.

As a last resort, comment out the line DEFINES += -DLIKWID_USE_HWLOC in make/config_defines.mk and rebuild (make distclean && make). This will use a different parser.

SF-N commented 1 year ago

I get:

(gdb) print(cputopo->numHWThreads)
$1 = 8

And

cat /sys/devices/system/cpu/present
0-7
cat /sys/devices/system/cpu/online
0-7

When commenting out line 360 #DEFINES += -DLIKWID_USE_HWLOC, I get the following error while make:

===>  COMPILE  GCCARMv8/topology_hwloc.o
./src/topology_hwloc.c:50:1: error: unknown type name ‘hwloc_topology_t’
   50 | hwloc_topology_t hwloc_topology = NULL;
      | ^~~~~~~~~~~~~~~~
./src/topology_hwloc.c:50:35: warning: initialization of ‘int’ from ‘void *’ makes integer from pointer without a cast [-Wint-conversion]
   50 | hwloc_topology_t hwloc_topology = NULL;
      |                                   ^~~~
make: *** [Makefile:302: GCCARMv8/topology_hwloc.o] Error 1
TomTheBear commented 1 year ago

OK, very surprising. The failing line contains only cputopo->numHWThreads that could lead to an 'invalid size'. And there are two other malloc calls with the same inputs in the lines before which seem to work. I have to think about other reasons.

It seems I have broken the "disabling of hwloc" somewhen in the past. Hwloc works on almost all systems that's why the disabling is tested rarely/almost never.

TomTheBear commented 1 year ago

Can you please run likwid-topology -V 3 and send the output. I assume there is some failure before the actual segfault like "no NUMA domains". I have seen that in the past on exotic hardware.

SF-N commented 1 year ago

Here it is:

likwid-topology -V 3
DEBUG - [proc_init_cpuInfo:336] PROC CpuInfo Family 8 Model 0 Stepping 0 isIntel 0 numHWThreads 8
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 2 Thread 0 Core 0 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 3 Thread 0 Core 1 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 4 Thread 0 Core 0 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 5 Thread 0 Core 1 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 6 Thread 0 Core 0 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:712] PROC Thread Pool PU 7 Thread 0 Core 1 Die 0 Socket 3 inCpuSet 1
DEBUG - [affinity_init:539] Affinity: Socket domains 4
DEBUG - [affinity_init:541] Affinity: CPU die domains 4
DEBUG - [affinity_init:546] Affinity: CPU cores per LLC 8
DEBUG - [affinity_init:549] Affinity: Cache domains 0
DEBUG - [affinity_init:553] Affinity: NUMA domains 1
DEBUG - [affinity_init:554] Affinity: All domains 10
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 8 HW threads on 8 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 2 HW threads on 2 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S1: 2 HW threads on 2 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S2: 2 HW threads on 2 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S3: 2 HW threads on 2 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 2 HW threads on 2 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D1: 2 HW threads on 2 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D2: 2 HW threads on 2 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D3: 2 HW threads on 2 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 2 HW threads on 2 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 2 HW threads on 2 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 2 HW threads on 2 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 2 HW threads on 2 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 8 HW threads on 2 cores
DEBUG - [create_lookups:290] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 2 T2C 0 T2S 1 T2D 1 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 3 T2C 1 T2S 1 T2D 1 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 4 T2C 0 T2S 2 T2D 2 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 5 T2C 1 T2S 2 T2D 2 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 6 T2C 0 T2S 3 T2D 3 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 7 T2C 1 T2S 3 T2D 3 T2LLC 0 T2M 0
--------------------------------------------------------------------------------
CPU name:   
CPU type:   nil
CPU stepping:   0
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:        4
Cores per socket:   2
Threads per core:   1
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             0           0          1             *                
3               0             1           0          1             *                
4               0             0           0          2             *                
5               0             1           0          2             *                
6               0             0           0          3             *                
7               0             1           0          3             *                
--------------------------------------------------------------------------------
Socket 0:       ( 0 1 )
Socket 1:       ( 2 3 )
Socket 2:       ( 4 5 )
Socket 3:       ( 6 7 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:          1
Size:           64 kB
Cache groups:       ( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 )
--------------------------------------------------------------------------------
Level:          2
Size:           2 MB
Cache groups:       ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 )
--------------------------------------------------------------------------------
Level:          3
Size:           4 MB
Cache groups:       ( 0 1 2 3 4 5 6 7 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:       1
--------------------------------------------------------------------------------
Domain:         0
Processors:     ( 0 1 2 3 4 5 6 7 )
Distances:      10
Free memory:        7138.31 MB
Total memory:       14898.7 MB
--------------------------------------------------------------------------------
TomTheBear commented 1 year ago

As I thought:

DEBUG - [affinity_init:549] Affinity: Cache domains 0

but it should be 1 since you seem to have a single L3 cache. The architecture is quite strange (when looking at the output): There are four sockets, each with 2 cores but all four sockets share a single L3 cache. It might be but it is against the current logic in LIKWID that each socket has its own L3 cache.

I'll check the cache domain detection

TomTheBear commented 1 year ago

So, it seems the 4 sockets cause the problem. The cache domain detection divides "Total number of caches" by "Socket count" and casts it to Integer. This results in 0 cache domains per socket.

Hard to fix without access for testing but I'll try to create a patch.