RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.67k stars 227 forks source link

kernel panic when use likwid #158

Open ugiwgh opened 6 years ago

ugiwgh commented 6 years ago

restored to 25.8.24.27@tcp (at 12.1.5.137@tcp1) [455044.384752] show_signal_msg: 92 callbacks suppressed [455044.389824] xxxx-xx[5199]: segfault at 104 ip 0000000000fb07c4 sp 00007ffff040c2d0 error 4 in xxxx-xx[400000+1243000] [493477.345464] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 [493477.353508] IP: [] kmem_cache_alloc_trace+0x80/0x200 [493477.360255] PGD fed698067 PUD feb6c7067 PMD 0 [493477.364932] Oops: 0000 [#1] SMP [493477.368298] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mic(OE) ioatdma nio_net(OE) nio_dev(OE) igb i2c_algo_bit ptp pps_core sb_edac dca edac_core iTCO_wdt ipmi_ssif ipmi_msghandler acpi_pad button acpi_cpufreq(E) [last unloaded: nio_dev] [493477.398973] CPU: 4 PID: 5763 Comm: slurmd Tainted: G OE ------------ T 3.10.0-514.TH.cn #22 [493477.408391] Hardware name: Default string Default string/Default string, BIOS 5.11 04/13/2018 [493477.417004] task: ffff8816abca9f40 ti: ffff88031d00c000 task.ti: ffff88031d00c000 [493477.424588] RIP: 0010:[] [] kmem_cache_alloc_trace+0x80/0x200 [493477.433715] RSP: 0018:ffff88031d00fdc0 EFLAGS: 00010282 [493477.439123] RAX: 0000000000000000 RBX: ffff880fedf1bc00 RCX: 0000000001ce3a59 [493477.446367] RDX: 0000000001ce3a58 RSI: 00000000000080d0 RDI: ffff881030003c00 [493477.453700] RBP: ffff88031d00fdf8 R08: 0000000000019860 R09: ffff881030003c00 [493477.460936] R10: ffffffff813e149d R11: 0000000000000000 R12: 0000000000000002 [493477.468242] R13: 00000000000080d0 R14: 0000000000000020 R15: ffff881030003c00 [493477.475541] FS: 00002b6f72827700(0000) GS:ffff88103fb00000(0000) knlGS:0000000000000000 [493477.483860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [493477.489719] CR2: 0000000000000002 CR3: 000000101fc9c000 CR4: 00000000003407e0 [493477.497026] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [493477.504305] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [493477.511587] Stack: [493477.513694] ffff881030003c00 ffffffff813e149d ffff880fedf1bc00 ffff880fedf1bc00 [493477.521319] 00000000000080d0 ffffffff81cc2ea0 0000000000000002 ffff88031d00fe18 [493477.528885] ffffffff813e149d ffff880fedf1bc00 ffff882026e1c000 ffff88031d00fe28 [493477.536450] Call Trace: [493477.538992] [] ? selinux_sk_alloc_security+0x2d/0x80 [493477.545703] [] selinux_sk_alloc_security+0x2d/0x80 [493477.552284] [] security_sk_alloc+0x16/0x20 [493477.558136] [] sk_prot_alloc+0x74/0x190 [493477.563711] [] sk_alloc+0x2c/0xd0 [493477.568779] [] inet_create+0xf0/0x360 [493477.574218] [] __sock_create+0x110/0x260 [493477.579912] [] SyS_socket+0x61/0xf0 [493477.585164] [] ? page_fault+0x28/0x30 [493477.590594] [] system_call_fastpath+0x16/0x1b [493477.596711] Code: d0 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 24 01 00 00 48 85 c0 0f 84 1b 01 00 00 49 63 47 20 48 8d 4a 01 4d 8b 07 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 [493477.616768] RIP [] kmem_cache_alloc_trace+0x80/0x200 [493477.623529] RSP [493477.627107] CR2: 0000000000000002 [493477.630511] ---[ end trace 2e74802de60e47e4 ]--- [493477.635247] Kernel panic - not syncing: Fatal exception

TomTheBear commented 6 years ago

This is the first time I see a kernel panic that should be caused by LIKWID. Can you please give more information. LIKWID version, your OS, kernel version, micro architecture, general environment (SLURM?). What do you execute exactly (command line)?

ugiwgh commented 6 years ago

LIKWID Version=4.3.2 OS=redhat7.3 Kernel Version=3.10.0-514 Micro=FLOPS_DP
I code it with likwid API, and run it with SLURM.

TomTheBear commented 6 years ago

I installed CentOS 7.3 with the same kernel in a virtual machine and don't get any errors. I used the self-written STREAM benchmark with MarkerAPI included. We have no RedHat license but CentOS should be close enough. The command line and the micro architecture are missing in your answer.

The trace names system call SyS_socket in the function chain. This corresponds to the socket() function which is used by LIKWID at two locations (server and client side of access daemon). Can you please check that you don't hit the limit of open files and open sockets (ulimit). In generall, the Linux kernel shouldn't crash when a user application tries to create a socket. Returning some error code, sure, but all user inputs should be catched properly, so it might be a kernel bug as well.

Since you are using SLURM, I assume you try to run LIKWID on a compute node of a cluster. Is LIKWID installed as a module or did you do it yourself? Does it work on the frontend node (I know risky when it crashes the system) but the only way to determine whether it is some configuration on the compute nodes or generally broken.

ugiwgh commented 6 years ago

Yes, I'm also confused that. I'm checking what happened. If there is news, I'll report them here.

TomTheBear commented 5 years ago

Any news?