RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.64k stars 226 forks source link

[BUG] Silent failure on multi-threaded runs #614

Open ivan-pi opened 5 months ago

ivan-pi commented 5 months ago

Describe the bug

likwid-pin appears to silently fail when using more than one thread, judging by the fact that the command exits almost immediately, and nothing is written to standard output.

To Reproduce

To Reproduce with a LIKWID command

Please supply the output of the command with -V 3 added to the command:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ likwid-pin -V 3 -c 0,1 ./albm
DEBUG - [hwloc_init_cpuInfo:359] HWLOC CpuInfo Family 6 Model 167 Stepping 1 Vendor 0x0 Part 0x0 isIntel 1 numHWThreads 16 activeHWThreads 16
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 8 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 9 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 10 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 11 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 12 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 13 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 6 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 14 Thread 1 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 7 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 15 Thread 1 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 0 Level 1 Size 49152 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 1 Level 2 Size 524288 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 2 Level 3 Size 16777216 Threads 16
DEBUG - [affinity_init:547] Affinity: Socket domains 1
DEBUG - [affinity_init:549] Affinity: CPU die domains 1
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 8
DEBUG - [affinity_init:557] Affinity: Cache domains 1
DEBUG - [affinity_init:561] Affinity: NUMA domains 1
DEBUG - [affinity_init:562] Affinity: All domains 5
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 16 HW threads on 8 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 16 HW threads on 8 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 16 HW threads on 8 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 16 HW threads on 8 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 16 HW threads on 8 cores
DEBUG - [create_lookups:290] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 2 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 3 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 4 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 5 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 6 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 7 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 8 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 9 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 10 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 11 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 12 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 13 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 14 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 15 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
Evaluated CPU string to CPUs: 0,1
Running: ./albm
Using 2 thread(s) (cpuset: 0x3)

In contrast with a single thread I get:

...
Evaluated CPU string to CPUs: 0
[likwid-pin] Main PID -> hwthread 0 - OK
Running: ./albm
Using 1 thread(s) (cpuset: 0x1)
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    3.43853497505188     
 mlups    14.9729455041758     
 ompwtime (s)    3.43853306770325     
 mlups    14.9729538096440     
 Total time (s)   3.43853306770325     
 Collision time ratio   1.559326410374301E-002
 Streaming time ratio   0.984065665745844     

If I run the application directly, it works as expected:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ OMP_NUM_THREADS=2 ./albm
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    1.81032705307007     
 mlups    28.4396107920625     
 ompwtime (s)    1.81032490730286     
 mlups    28.4396445013620     
 Total time (s)   1.81032490730286     
 Collision time ratio   1.993282543925349E-002
 Streaming time ratio   0.979440022346742     
TomTheBear commented 5 months ago

Thanks for reporting. I never seen such a behavior.

Does it work with other applications and multiple threads? Are you using some computing library like TBB, Cilk+, SYCL, ...? If it is OpenMP, is it one of the common implementations (GCC, LLVM, Intel)?