[BUG] Segmentation Fault in likwid-bench when executing stream_mem benchmark on Epyc 9374F

fairydreaming commented 4 days ago

The stream_mem benchmark in likwid-bench always crashes after starting threads on my Epyc 9374F:

$ likwid-bench -t stream_mem -i 128 -w M0:8GB -w M1:8GB -w M2:8GB -w M3:8GB -w M4:8GB -w M5:8GB -w M6:8GB -w M7:8GB
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 0 (Domain M0) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 0 (Domain M0) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 0 (Domain M0) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 4 (Domain M1) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 4 (Domain M1) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 4 (Domain M1) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 8 (Domain M2) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 8 (Domain M2) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 8 (Domain M2) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 12 (Domain M3) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 12 (Domain M3) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 12 (Domain M3) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 16 (Domain M4) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 16 (Domain M4) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 16 (Domain M4) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 20 (Domain M5) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 20 (Domain M5) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 20 (Domain M5) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 24 (Domain M6) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 24 (Domain M6) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 24 (Domain M6) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
Warning: Sanitizing vector length to a multiple of the loop stride 4 and thread count 8 from 333333333 elements (2666666664 bytes) to 333333312 elements (2666666496 bytes)
Allocate: Process running on hwthread 28 (Domain M7) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 28 (Domain M7) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Allocate: Process running on hwthread 28 (Domain M7) - Vector length 333333312/2666666496 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: stream_mem
--------------------------------------------------------------------------------
Using 8 work groups
Using 64 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 41666664 Offset 0
Group: 0 Thread 1 Global Thread 1 running on hwthread 32 - Vector length 41666664 Offset 41666664
Group: 0 Thread 2 Global Thread 2 running on hwthread 1 - Vector length 41666664 Offset 83333328
Group: 0 Thread 3 Global Thread 3 running on hwthread 33 - Vector length 41666664 Offset 124999992
Group: 0 Thread 4 Global Thread 4 running on hwthread 2 - Vector length 41666664 Offset 166666656
Group: 0 Thread 5 Global Thread 5 running on hwthread 34 - Vector length 41666664 Offset 208333320
Group: 0 Thread 7 Global Thread 7 running on hwthread 35 - Vector length 41666664 Offset 291666648
Group: 0 Thread 6 Global Thread 6 running on hwthread 3 - Vector length 41666664 Offset 249999984
Group: 1 Thread 1 Global Thread 9 running on hwthread 36 - Vector length 41666664 Offset 41666664
Group: 1 Thread 0 Global Thread 8 running on hwthread 4 - Vector length 41666664 Offset 0
Group: 1 Thread 2 Global Thread 10 running on hwthread 5 - Vector length 41666664 Offset 83333328
Group: 1 Thread 3 Global Thread 11 running on hwthread 37 - Vector length 41666664 Offset 124999992
Group: 1 Thread 5 Global Thread 13 running on hwthread 38 - Vector length 41666664 Offset 208333320
Group: 1 Thread 4 Global Thread 12 running on hwthread 6 - Vector length 41666664 Offset 166666656
Group: 1 Thread 6 Global Thread 14 running on hwthread 7 - Vector length 41666664 Offset 249999984
Group: 1 Thread 7 Global Thread 15 running on hwthread 39 - Vector length 41666664 Offset 291666648
Group: 2 Thread 0 Global Thread 16 running on hwthread 8 - Vector length 41666664 Offset 0
Group: 2 Thread 4 Global Thread 20 running on hwthread 10 - Vector length 41666664 Offset 166666656
Group: 2 Thread 5 Global Thread 21 running on hwthread 42 - Vector length 41666664 Offset 208333320
Group: 2 Thread 3 Global Thread 19 running on hwthread 41 - Vector length 41666664 Offset 124999992
Group: 2 Thread 7 Global Thread 23 running on hwthread 43 - Vector length 41666664 Offset 291666648
Group: 2 Thread 1 Global Thread 17 running on hwthread 40 - Vector length 41666664 Offset 41666664
Group: 2 Thread 6 Global Thread 22 running on hwthread 11 - Vector length 41666664 Offset 249999984
Group: 3 Thread 2 Global Thread 26 running on hwthread 13 - Vector length 41666664 Offset 83333328
Group: 3 Thread 3 Global Thread 27 running on hwthread 45 - Vector length 41666664 Offset 124999992
Group: 3 Thread 0 Global Thread 24 running on hwthread 12 - Vector length 41666664 Offset 0
Group: 2 Thread 2 Global Thread 18 running on hwthread 9 - Vector length 41666664 Offset 83333328
Group: 3 Thread 1 Global Thread 25 running on hwthread 44 - Vector length 41666664 Offset 41666664
Group: 3 Thread 4 Global Thread 28 running on hwthread 14 - Vector length 41666664 Offset 166666656
Group: 3 Thread 5 Global Thread 29 running on hwthread 46 - Vector length 41666664 Offset 208333320
Group: 3 Thread 6 Global Thread 30 running on hwthread 15 - Vector length 41666664 Offset 249999984
Group: 3 Thread 7 Global Thread 31 running on hwthread 47 - Vector length 41666664 Offset 291666648
Group: 4 Thread 1 Global Thread 33 running on hwthread 48 - Vector length 41666664 Offset 41666664
Group: 4 Thread 5 Global Thread 37 running on hwthread 50 - Vector length 41666664 Offset 208333320
Group: 4 Thread 3 Global Thread 35 running on hwthread 49 - Vector length 41666664 Offset 124999992
Group: 4 Thread 2 Global Thread 34 running on hwthread 17 - Vector length 41666664 Offset 83333328
Group: 4 Thread 4 Global Thread 36 running on hwthread 18 - Vector length 41666664 Offset 166666656
Group: 4 Thread 0 Global Thread 32 running on hwthread 16 - Vector length 41666664 Offset 0
Group: 4 Thread 6 Global Thread 38 running on hwthread 19 - Vector length 41666664 Offset 249999984
Group: 5 Thread 1 Global Thread 41 running on hwthread 52 - Vector length 41666664 Offset 41666664
Group: 5 Thread 2 Global Thread 42 running on hwthread 21 - Vector length 41666664 Offset 83333328
Group: 4 Thread 7 Global Thread 39 running on hwthread 51 - Vector length 41666664 Offset 291666648
Group: 5 Thread 4 Global Thread 44 running on hwthread 22 - Vector length 41666664 Offset 166666656
Group: 5 Thread 5 Global Thread 45 running on hwthread 54 - Vector length 41666664 Offset 208333320
Group: 5 Thread 0 Global Thread 40 running on hwthread 20 - Vector length 41666664 Offset 0
Group: 5 Thread 6 Global Thread 46 running on hwthread 23 - Vector length 41666664 Offset 249999984
Group: 5 Thread 3 Global Thread 43 running on hwthread 53 - Vector length 41666664 Offset 124999992
Group: 5 Thread 7 Global Thread 47 running on hwthread 55 - Vector length 41666664 Offset 291666648
Group: 6 Thread 0 Global Thread 48 running on hwthread 24 - Vector length 41666664 Offset 0
Group: 6 Thread 1 Global Thread 49 running on hwthread 56 - Vector length 41666664 Offset 41666664
Group: 6 Thread 3 Global Thread 51 running on hwthread 57 - Vector length 41666664 Offset 124999992
Group: 6 Thread 2 Global Thread 50 running on hwthread 25 - Vector length 41666664 Offset 83333328
Group: 6 Thread 4 Global Thread 52 running on hwthread 26 - Vector length 41666664 Offset 166666656
Group: 6 Thread 6 Global Thread 54 running on hwthread 27 - Vector length 41666664 Offset 249999984
Group: 6 Thread 5 Global Thread 53 running on hwthread 58 - Vector length 41666664 Offset 208333320
Group: 6 Thread 7 Global Thread 55 running on hwthread 59 - Vector length 41666664 Offset 291666648
Group: 7 Thread 2 Global Thread 58 running on hwthread 29 - Vector length 41666664 Offset 83333328
Group: 7 Thread 1 Global Thread 57 running on hwthread 60 - Vector length 41666664 Offset 41666664
Group: 7 Thread 3 Global Thread 59 running on hwthread 61 - Vector length 41666664 Offset 124999992
Group: 7 Thread 5 Global Thread 61 running on hwthread 62 - Vector length 41666664 Offset 208333320
Group: 7 Thread 6 Global Thread 62 running on hwthread 31 - Vector length 41666664 Offset 249999984
Group: 7 Thread 0 Global Thread 56 running on hwthread 28 - Vector length 41666664 Offset 0
Group: 7 Thread 7 Global Thread 63 running on hwthread 63 - Vector length 41666664 Offset 291666648
Group: 7 Thread 4 Global Thread 60 running on hwthread 30 - Vector length 41666664 Offset 166666656
Segmentation fault

The crash happens in stream_mem() function:

Thread 61 "likwid-bench" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff09aa006c0 (LWP 5015)]
0x0000555555560683 in stream_mem ()
(gdb) bt
#0  0x0000555555560683 in stream_mem ()
#1  0x0000555555563dbb in runTest (arg=0x555555602ca8) at ./src/bench.c:189
#2  0x00007ffff669ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#3  0x00007ffff6729c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

I tried some other benchmarks like stream, stream_avx512, stream_mem_avx512, they run without any crashes.

Note that I have NUMA per socket BIOS option set to NPS4 and ACPI SRAT L3 Cache as NUMA Domain option enabled, so overall there are 8 NUMA domains in my system.

I have likwid version v5.4.0 compiled from the github release source code. My operating system is Ubuntu 24.04.1 LTS. The gcc version is gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0.

Output of likwid-bench -p:

$ likwid-bench -p
Number of Domains 19
Domain 0:
    Tag N: 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63
Domain 1:
    Tag S0: 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63
Domain 2:
    Tag D0: 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63
Domain 3:
    Tag C0: 0 32 1 33 2 34 3 35
Domain 4:
    Tag C1: 4 36 5 37 6 38 7 39
Domain 5:
    Tag C2: 8 40 9 41 10 42 11 43
Domain 6:
    Tag C3: 12 44 13 45 14 46 15 47
Domain 7:
    Tag C4: 16 48 17 49 18 50 19 51
Domain 8:
    Tag C5: 20 52 21 53 22 54 23 55
Domain 9:
    Tag C6: 24 56 25 57 26 58 27 59
Domain 10:
    Tag C7: 28 60 29 61 30 62 31 63
Domain 11:
    Tag M0: 0 32 1 33 2 34 3 35
Domain 12:
    Tag M1: 4 36 5 37 6 38 7 39
Domain 13:
    Tag M2: 8 40 9 41 10 42 11 43
Domain 14:
    Tag M3: 12 44 13 45 14 46 15 47
Domain 15:
    Tag M4: 16 48 17 49 18 50 19 51
Domain 16:
    Tag M5: 20 52 21 53 22 54 23 55
Domain 17:
    Tag M6: 24 56 25 57 26 58 27 59
Domain 18:
    Tag M7: 28 60 29 61 30 62 31 63

Output of likwid-topology -V 3:

$ likwid-topology -V 3
DEBUG - [hwloc_init_cpuInfo:361] HWLOC CpuInfo Family 25 Model 17 Stepping 1 Vendor 0x0 Part 0x0 isIntel 0 numHWThreads 64 activeHWThreads 64
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 32 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 33 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 34 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 35 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 36 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 37 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 6 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 38 Thread 1 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 7 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 39 Thread 1 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 8 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 40 Thread 1 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 9 Thread 0 Core 9 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 41 Thread 1 Core 9 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 10 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 42 Thread 1 Core 10 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 11 Thread 0 Core 11 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 43 Thread 1 Core 11 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 12 Thread 0 Core 12 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 44 Thread 1 Core 12 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 13 Thread 0 Core 13 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 45 Thread 1 Core 13 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 14 Thread 0 Core 14 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 46 Thread 1 Core 14 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 15 Thread 0 Core 15 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 47 Thread 1 Core 15 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 16 Thread 0 Core 16 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 48 Thread 1 Core 16 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 17 Thread 0 Core 17 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 49 Thread 1 Core 17 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 18 Thread 0 Core 18 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 50 Thread 1 Core 18 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 19 Thread 0 Core 19 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 51 Thread 1 Core 19 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 20 Thread 0 Core 20 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 52 Thread 1 Core 20 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 21 Thread 0 Core 21 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 53 Thread 1 Core 21 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 22 Thread 0 Core 22 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 54 Thread 1 Core 22 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 23 Thread 0 Core 23 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 55 Thread 1 Core 23 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 24 Thread 0 Core 24 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 56 Thread 1 Core 24 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 25 Thread 0 Core 25 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 57 Thread 1 Core 25 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 26 Thread 0 Core 26 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 58 Thread 1 Core 26 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 27 Thread 0 Core 27 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 59 Thread 1 Core 27 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 28 Thread 0 Core 28 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 60 Thread 1 Core 28 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 29 Thread 0 Core 29 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 61 Thread 1 Core 29 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 30 Thread 0 Core 30 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 62 Thread 1 Core 30 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 31 Thread 0 Core 31 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:582] HWLOC Thread Pool PU 63 Thread 1 Core 31 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:812] HWLOC Cache Pool ID 0 Level 1 Size 32768 Threads 2
DEBUG - [hwloc_init_cacheTopology:812] HWLOC Cache Pool ID 1 Level 2 Size 1048576 Threads 2
DEBUG - [hwloc_init_cacheTopology:812] HWLOC Cache Pool ID 2 Level 3 Size 33554432 Threads 8
DEBUG - [topology_init:1719] Setting up tree
DEBUG - [topology_setupTree:1475] Adding socket 0
DEBUG - [topology_setupTree:1489] Adding core 0 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 0 at core 0 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 1 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 1 at core 1 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 2 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 2 at core 2 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 3 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 3 at core 3 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 4 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 4 at core 4 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 5 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 5 at core 5 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 6 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 6 at core 6 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 7 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 7 at core 7 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 8 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 8 at core 8 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 9 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 9 at core 9 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 10 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 10 at core 10 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 11 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 11 at core 11 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 12 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 12 at core 12 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 13 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 13 at core 13 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 14 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 14 at core 14 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 15 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 15 at core 15 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 16 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 16 at core 16 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 17 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 17 at core 17 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 18 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 18 at core 18 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 19 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 19 at core 19 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 20 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 20 at core 20 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 21 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 21 at core 21 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 22 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 22 at core 22 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 23 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 23 at core 23 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 24 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 24 at core 24 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 25 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 25 at core 25 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 26 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 26 at core 26 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 27 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 27 at core 27 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 28 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 28 at core 28 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 29 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 29 at core 29 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 30 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 30 at core 30 on socket 0
DEBUG - [topology_setupTree:1489] Adding core 31 to socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 31 at core 31 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 32 at core 0 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 33 at core 1 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 34 at core 2 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 35 at core 3 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 36 at core 4 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 37 at core 5 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 38 at core 6 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 39 at core 7 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 40 at core 8 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 41 at core 9 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 42 at core 10 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 43 at core 11 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 44 at core 12 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 45 at core 13 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 46 at core 14 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 47 at core 15 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 48 at core 16 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 49 at core 17 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 50 at core 18 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 51 at core 19 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 52 at core 20 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 53 at core 21 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 54 at core 22 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 55 at core 23 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 56 at core 24 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 57 at core 25 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 58 at core 26 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 59 at core 27 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 60 at core 28 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 61 at core 29 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 62 at core 30 on socket 0
DEBUG - [topology_setupTree:1498] Adding hwthread 63 at core 31 on socket 0
DEBUG - [topology_setupTree:1504] Determine number of sockets. tree tells 1
DEBUG - [topology_setupTree:1509] Determine number of cores per socket. tree tells 32
DEBUG - [topology_setupTree:1514] Determine number of hwthreads per cores. tree tells 2
DEBUG - [affinity_init:863] Affinity: Socket domains 1
DEBUG - [affinity_init:865] Affinity: CPU die domains 1
DEBUG - [affinity_init:870] Affinity: CPU cores per LLC 4
DEBUG - [affinity_init:873] Affinity: Cache domains 8
DEBUG - [affinity_init:877] Affinity: NUMA domains 8
DEBUG - [affinity_init:908] Affinity: All domains 19
DEBUG - [affinity_addNodeDomain:548] Affinity domain N: 64 HW threads on 32 cores
DEBUG - [affinity_addSocketDomain:585] Affinity domain S0: 64 HW threads on 32 cores
DEBUG - [affinity_addDieDomain:628] Affinity domain D0: 64 HW threads on 32 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C0: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C1: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C2: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C3: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C4: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C5: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C6: 8 HW threads on 4 cores
DEBUG - [affinity_addCacheDomain:670] Affinity domain C7: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M0: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M1: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M2: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M3: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M4: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M5: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M6: 8 HW threads on 4 cores
DEBUG - [affinity_addMemoryDomain:724] Affinity domain M7: 8 HW threads on 4 cores
DEBUG - [create_lookups:461] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 2 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 3 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 4 T2C 4 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 5 T2C 5 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 6 T2C 6 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 7 T2C 7 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 8 T2C 8 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 9 T2C 9 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 10 T2C 10 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 11 T2C 11 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 12 T2C 12 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 13 T2C 13 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 14 T2C 14 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 15 T2C 15 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 16 T2C 16 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 17 T2C 17 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 18 T2C 18 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 19 T2C 19 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 20 T2C 20 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 21 T2C 21 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 22 T2C 22 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 23 T2C 23 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 24 T2C 24 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 25 T2C 25 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 26 T2C 26 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 27 T2C 27 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 28 T2C 28 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 29 T2C 29 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 30 T2C 30 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 31 T2C 31 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 32 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 33 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 34 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 35 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:461] T 36 T2C 4 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 37 T2C 5 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 38 T2C 6 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 39 T2C 7 T2S 0 T2D 0 T2LLC 1 T2M 1
DEBUG - [create_lookups:461] T 40 T2C 8 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 41 T2C 9 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 42 T2C 10 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 43 T2C 11 T2S 0 T2D 0 T2LLC 2 T2M 2
DEBUG - [create_lookups:461] T 44 T2C 12 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 45 T2C 13 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 46 T2C 14 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 47 T2C 15 T2S 0 T2D 0 T2LLC 3 T2M 3
DEBUG - [create_lookups:461] T 48 T2C 16 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 49 T2C 17 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 50 T2C 18 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 51 T2C 19 T2S 0 T2D 0 T2LLC 4 T2M 4
DEBUG - [create_lookups:461] T 52 T2C 20 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 53 T2C 21 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 54 T2C 22 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 55 T2C 23 T2S 0 T2D 0 T2LLC 5 T2M 5
DEBUG - [create_lookups:461] T 56 T2C 24 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 57 T2C 25 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 58 T2C 26 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 59 T2C 27 T2S 0 T2D 0 T2LLC 6 T2M 6
DEBUG - [create_lookups:461] T 60 T2C 28 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 61 T2C 29 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 62 T2C 30 T2S 0 T2D 0 T2LLC 7 T2M 7
DEBUG - [create_lookups:461] T 63 T2C 31 T2S 0 T2D 0 T2LLC 7 T2M 7
--------------------------------------------------------------------------------
CPU name:   AMD EPYC 9374F 32-Core Processor               
CPU type:   AMD K19 (Zen4) architecture
CPU stepping:   1
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:        1
CPU dies:       1
Cores per socket:   32
Threads per core:   2
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             2           0          0             *                
3               0             3           0          0             *                
4               0             4           0          0             *                
5               0             5           0          0             *                
6               0             6           0          0             *                
7               0             7           0          0             *                
8               0             8           0          0             *                
9               0             9           0          0             *                
10              0             10          0          0             *                
11              0             11          0          0             *                
12              0             12          0          0             *                
13              0             13          0          0             *                
14              0             14          0          0             *                
15              0             15          0          0             *                
16              0             16          0          0             *                
17              0             17          0          0             *                
18              0             18          0          0             *                
19              0             19          0          0             *                
20              0             20          0          0             *                
21              0             21          0          0             *                
22              0             22          0          0             *                
23              0             23          0          0             *                
24              0             24          0          0             *                
25              0             25          0          0             *                
26              0             26          0          0             *                
27              0             27          0          0             *                
28              0             28          0          0             *                
29              0             29          0          0             *                
30              0             30          0          0             *                
31              0             31          0          0             *                
32              1             0           0          0             *                
33              1             1           0          0             *                
34              1             2           0          0             *                
35              1             3           0          0             *                
36              1             4           0          0             *                
37              1             5           0          0             *                
38              1             6           0          0             *                
39              1             7           0          0             *                
40              1             8           0          0             *                
41              1             9           0          0             *                
42              1             10          0          0             *                
43              1             11          0          0             *                
44              1             12          0          0             *                
45              1             13          0          0             *                
46              1             14          0          0             *                
47              1             15          0          0             *                
48              1             16          0          0             *                
49              1             17          0          0             *                
50              1             18          0          0             *                
51              1             19          0          0             *                
52              1             20          0          0             *                
53              1             21          0          0             *                
54              1             22          0          0             *                
55              1             23          0          0             *                
56              1             24          0          0             *                
57              1             25          0          0             *                
58              1             26          0          0             *                
59              1             27          0          0             *                
60              1             28          0          0             *                
61              1             29          0          0             *                
62              1             30          0          0             *                
63              1             31          0          0             *                
--------------------------------------------------------------------------------
Socket 0:       ( 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:          1
Size:           32 kB
Cache groups:       ( 0 32 ) ( 1 33 ) ( 2 34 ) ( 3 35 ) ( 4 36 ) ( 5 37 ) ( 6 38 ) ( 7 39 ) ( 8 40 ) ( 9 41 ) ( 10 42 ) ( 11 43 ) ( 12 44 ) ( 13 45 ) ( 14 46 ) ( 15 47 ) ( 16 48 ) ( 17 49 ) ( 18 50 ) ( 19 51 ) ( 20 52 ) ( 21 53 ) ( 22 54 ) ( 23 55 ) ( 24 56 ) ( 25 57 ) ( 26 58 ) ( 27 59 ) ( 28 60 ) ( 29 61 ) ( 30 62 ) ( 31 63 )
--------------------------------------------------------------------------------
Level:          2
Size:           1 MB
Cache groups:       ( 0 32 ) ( 1 33 ) ( 2 34 ) ( 3 35 ) ( 4 36 ) ( 5 37 ) ( 6 38 ) ( 7 39 ) ( 8 40 ) ( 9 41 ) ( 10 42 ) ( 11 43 ) ( 12 44 ) ( 13 45 ) ( 14 46 ) ( 15 47 ) ( 16 48 ) ( 17 49 ) ( 18 50 ) ( 19 51 ) ( 20 52 ) ( 21 53 ) ( 22 54 ) ( 23 55 ) ( 24 56 ) ( 25 57 ) ( 26 58 ) ( 27 59 ) ( 28 60 ) ( 29 61 ) ( 30 62 ) ( 31 63 )
--------------------------------------------------------------------------------
Level:          3
Size:           32 MB
Cache groups:       ( 0 32 1 33 2 34 3 35 ) ( 4 36 5 37 6 38 7 39 ) ( 8 40 9 41 10 42 11 43 ) ( 12 44 13 45 14 46 15 47 ) ( 16 48 17 49 18 50 19 51 ) ( 20 52 21 53 22 54 23 55 ) ( 24 56 25 57 26 58 27 59 ) ( 28 60 29 61 30 62 31 63 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:       8
--------------------------------------------------------------------------------
Domain:         0
Processors:     ( 0 32 1 33 2 34 3 35 )
Distances:      10 11 12 12 12 12 12 12
Free memory:        47501.7 MB
Total memory:       48051.2 MB
--------------------------------------------------------------------------------
Domain:         1
Processors:     ( 4 36 5 37 6 38 7 39 )
Distances:      11 10 12 12 12 12 12 12
Free memory:        48001.7 MB
Total memory:       48381.1 MB
--------------------------------------------------------------------------------
Domain:         2
Processors:     ( 8 40 9 41 10 42 11 43 )
Distances:      12 12 10 11 12 12 12 12
Free memory:        47796.1 MB
Total memory:       48381.1 MB
--------------------------------------------------------------------------------
Domain:         3
Processors:     ( 12 44 13 45 14 46 15 47 )
Distances:      12 12 11 10 12 12 12 12
Free memory:        48184.1 MB
Total memory:       48381.1 MB
--------------------------------------------------------------------------------
Domain:         4
Processors:     ( 16 48 17 49 18 50 19 51 )
Distances:      12 12 12 12 10 11 12 12
Free memory:        48147.4 MB
Total memory:       48381.1 MB
--------------------------------------------------------------------------------
Domain:         5
Processors:     ( 20 52 21 53 22 54 23 55 )
Distances:      12 12 12 12 11 10 12 12
Free memory:        47994 MB
Total memory:       48338 MB
--------------------------------------------------------------------------------
Domain:         6
Processors:     ( 24 56 25 57 26 58 27 59 )
Distances:      12 12 12 12 12 12 10 11
Free memory:        47954.4 MB
Total memory:       48381.1 MB
--------------------------------------------------------------------------------
Domain:         7
Processors:     ( 28 60 29 61 30 62 31 63 )
Distances:      12 12 12 12 12 12 11 10
Free memory:        48146.8 MB
Total memory:       48338.3 MB
--------------------------------------------------------------------------------

Let me know if you need any other information.

fairydreaming commented 4 days ago

I tried disassembling the crashing function:

Thread 9 "likwid-bench" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff1056006c0 (LWP 5251)]
0x0000555555560683 in stream_mem ()
(gdb) disass
Dump of assembler code for function stream_mem:
   0x0000555555560600 <+0>: push   %rbp
   0x0000555555560601 <+1>: mov    %rsp,%rbp
   0x0000555555560604 <+4>: push   %rbx
   0x0000555555560605 <+5>: push   %r12
   0x0000555555560607 <+7>: push   %r13
   0x0000555555560609 <+9>: push   %r14
   0x000055555556060b <+11>:    push   %r15
   0x000055555556060d <+13>:    movsd  0x254ab(%rip),%xmm4        # 0x555555585ac0
   0x0000555555560615 <+21>:    xor    %rax,%rax
   0x0000555555560618 <+24>:    data16 cs nopw 0x0(%rax,%rax,1)
   0x0000555555560623 <+35>:    data16 cs nopw 0x0(%rax,%rax,1)
   0x000055555556062e <+46>:    data16 cs nopw 0x0(%rax,%rax,1)
   0x0000555555560639 <+57>:    nopl   0x0(%rax)
   0x0000555555560640 <+64>:    movsd  (%rdx,%rax,8),%xmm0
   0x0000555555560645 <+69>:    movsd  0x8(%rdx,%rax,8),%xmm1
   0x000055555556064b <+75>:    movsd  0x10(%rdx,%rax,8),%xmm2
   0x0000555555560651 <+81>:    movsd  0x18(%rdx,%rax,8),%xmm3
   0x0000555555560657 <+87>:    mulsd  %xmm4,%xmm0
   0x000055555556065b <+91>:    addsd  (%rcx,%rax,8),%xmm0
   0x0000555555560660 <+96>:    mulsd  %xmm4,%xmm1
   0x0000555555560664 <+100>:   addsd  0x8(%rcx,%rax,8),%xmm1
   0x000055555556066a <+106>:   mulsd  %xmm4,%xmm2
   0x000055555556066e <+110>:   addsd  0x10(%rcx,%rax,8),%xmm2
   0x0000555555560674 <+116>:   mulsd  %xmm4,%xmm3
   0x0000555555560678 <+120>:   addsd  0x18(%rcx,%rax,8),%xmm3
   0x000055555556067e <+126>:   movntdq %xmm0,(%rsi,%rax,8)
=> 0x0000555555560683 <+131>:   movntdq %xmm1,0x8(%rsi,%rax,8)
   0x0000555555560689 <+137>:   movntdq %xmm2,0x10(%rsi,%rax,8)
   0x000055555556068f <+143>:   movntdq %xmm3,0x18(%rsi,%rax,8)
   0x0000555555560695 <+149>:   add    $0x4,%rax
   0x0000555555560699 <+153>:   cmp    %rdi,%rax
   0x000055555556069c <+156>:   jl     0x555555560640 <stream_mem+64>
   0x000055555556069e <+158>:   pop    %r15
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555606a0 <+160>:   pop    %r14
   0x00005555555606a2 <+162>:   pop    %r13
   0x00005555555606a4 <+164>:   pop    %r12
   0x00005555555606a6 <+166>:   pop    %rbx
   0x00005555555606a7 <+167>:   mov    %rbp,%rsp
   0x00005555555606aa <+170>:   pop    %rbp
   0x00005555555606ab <+171>:   ret
End of assembler dump.
(gdb) info registers
rax            0x0                 0
rbx            0x1                 1
rcx            0x7ffea473e6c0      140731657479872
rdx            0x7fff4373e6c0      140734325057216
rsi            0x7fffe273e6c0      140736992634560
rdi            0x27bc868           41666664
rbp            0x7ff1055ffe20      0x7ff1055ffe20
rsp            0x7ff1055ffdf8      0x7ff1055ffdf8
r8             0x0                 0
r9             0x40                64
r10            0x13de4340          333333312
r11            0x8                 8
r12            0x27bc868           41666664
r13            0x555555571cba      93824992353466
r14            0x555555560600      93824992282112
r15            0x7ff1055ffe60      140673154023008
rip            0x555555560683      0x555555560683 <stream_mem+131>
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x1084081e          277088286
k1             0xfffe0000          4294836224
k2             0xfbfffbff          4227857407
k3             0x0                 0
k4             0x2040400           33817600
k5             0x80                128
k6             0x0                 0
k7             0x0                 0
fs_base        0x7ff1056006c0      140673154025152
gs_base        0x0                 0

I wonder if the problem is caused by memory alignment when doing movntdq (shall be aligned to 16, is aligned to 8?).

fairydreaming commented 4 days ago

Here's how I fixed this:

diff --git a/bench/GCC/stream_mem.pas.bak b/bench/GCC/stream_mem.pas
index 9e61bbc..8493f0e 100644
--- a/bench/GCC/stream_mem.pas.bak
+++ b/bench/GCC/stream_mem.pas
@@ -40,10 +40,10 @@ mulsd    FPR3, FPR5
 addsd    FPR3, [STR2 + GPR1*8+16]
 mulsd    FPR4, FPR5
 addsd    FPR4, [STR2 + GPR1*8+24]
-movntdq   [STR0 + GPR1*8], FPR1
-movntdq   [STR0 + GPR1*8+8], FPR2
-movntdq   [STR0 + GPR1*8+16], FPR3
-movntdq   [STR0 + GPR1*8+24], FPR4
+unpcklpd FPR1,FPR2
+unpcklpd FPR3,FPR4
+movntpd   [STR0 + GPR1*8], FPR1
+movntpd   [STR0 + GPR1*8+16], FPR3

 }

but my knowledge of assembly is limited, so I'm not 100% sure it's correct.

TomTheBear commented 4 days ago

Thanks for the issue and the great analysis (:bouquet:). You are correct, movntpd requires 16 byte aligned addresses. Your fix using unpck* is correct.

Description for unpcklpd FPR1,FPR2:

FPR1[0:63] = FPR1[0:63]
FPR1[64:127] = FPR2[0:63]

The main issue I see with the kernel is that it is not pure scalar code. It requires SSE to work since movnt* is a SSE instruction. Arithmetic is scalar but the data movement requires SSE.

Can you please open a PR with your fix so that you are associated with the fix. Please update the description to "uses scalar arithmetic and SSE non-temporal stores". There is currently no stream_sp_mem, so if you like, include that kernel in the PR as well.

fairydreaming commented 4 days ago

@TomTheBear Sure, I'll try to prepare a PR.

While we are at it, are INSTR_LOOP 7 and UOPS 8 values correct in stream_mem.ptt? I mean in stream.ptt they are 19 and 26 but these kernels differ only in store instructions, everything else is the same. So I think the correct value for INSTR_LOOP in stream_mem.ptt is 19 as well, and my fix didn't change the number of instructions.

Regarding UOPS - in stream.ptt, did you count 2 (unfused domain) UOPS for addsd and movsd stores and 1 UOP for everything else, that is 24 + 2 for the loop logic = 26? Now in the corrected stream_mem.ptt instruction movntpd(M128, XMM) also has 2 UOPS in unfused domain, but instruction unpcklpd(XMM, XMM) has only 1 UOP, so I guess the value of UOPS in stream_mem.ptt after my fix shall be 22 + 2 = 24?

Please correct me if I'm wrong, as this is all something completely new to me.

TomTheBear commented 3 days ago

https://github.com/RRZE-HPC/likwid/pull/650#issuecomment-2500895809

I always use the fused-domain uops because that's what you get from the hardware when measuring uops retired.

TomTheBear commented 2 days ago

Regarding your posts on phoronix.com:

likwid-bench does not include the write-allocate/read-for-ownership (you call them phantom reads in your post). They could be included (as we know what is executing) but we agreed internally to not do that because it is not performance/data transfer as seen by the application. Moreover, recent Intel chips and ARM chips have their own mechanisms to avoid the write-allocate/RFO/"phantom reads" (Intel: SpecI2M, ARM: cache-line claim).
The naming of the kernels can be misleading. For us, triad is the "Schönauer Triad".
Thanks for "I tried likwid-bench and it's very good, finally a way to perform NUMA-aware benchmarks without much hassle"

fairydreaming commented 2 days ago

Regarding your posts on phoronix.com:

* `likwid-bench` does not include the write-allocate/read-for-ownership (you call them phantom reads in your post). They could be included (as we know what is executing) but we agreed internally to not do that because it is not performance/data transfer as seen by the application. Moreover, recent Intel chips and ARM chips have their own mechanisms to avoid the write-allocate/RFO/"phantom reads" (Intel: SpecI2M, ARM: cache-line claim).

Thank you for the clarification on this matter.

RRZE-HPC / likwid

[BUG] Segmentation Fault in likwid-bench when executing stream_mem benchmark on Epyc 9374F #649