coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

[Kernel panic] NMI watchdog: Watchdog detected hard LOCKUP on cpu #2655

Closed mtinny closed 4 years ago

mtinny commented 4 years ago

Issue Report

Bug

We randomly see OS reboot because of kernel panic "hard LOCKUP on cpu" Server instance : 11 servers out of 20 servers in 2 years Frequency : 0 ~ 2 times per server per month OS reboot randomly happens irrelevant to datetime/serverInstance/load.

Container Linux Version

# cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.7.0
VERSION_ID=1353.7.0
BUILD_ID=2017-04-26-2154
PRETTY_NAME="Container Linux by CoreOS 1353.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"

Environment

Actual Behavior

Random OS reboot by kernel panic "hard LOCKUP on cpu"

Reproduction Steps

We cannot intentionally reproduce this issue. It randomly happens irrelevant to datetime/serverInstance/load.

Other Information

/sys/fs/pstore/dmesg-erst-6818604196963549185

<0>[1401585.395801] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3dModules linked in:c binfmt_miscc udp_diagc tcp_diagc inet_diagc xt_setc xt_multiportc iptable_manglec iptable_rawc ip_set_hash_ipc ip_set_hash_netc ip_setc ipipc tunnel4c ip_tunnelc vethc nf_conntrack_netlinkc nfnetlinkc xt_statisticc xt_natc xt_recentc ipt_REJECTc nf_reject_ipv4c nfsv3c nfs_aclc rpcsec_gss_krb5c auth_rpcgssc nfsv4c nfsc lockdc gracec sunrpcc fscachec xt_commentc xt_markc ipt_MASQUERADEc nf_nat_masquerade_ipv4c xfrm_userc xfrm_algoc iptable_natc nf_conntrack_ipv4c nf_defrag_ipv4c nf_nat_ipv4c xt_addrtypec iptable_filterc xt_conntrackc nf_natc nf_conntrackc br_netfilterc bridgec overlayc 8021qc garpc mrpc stpc llcc coretempc sb_edacc ipmi_ssifc i2c_corec edac_corec ipmi_devintfc nls_asciic nls_cp437c vfatc fatc x86_pkg_temp_thermalc kvm_intelc kvmc ipmi_sic evdevc dcdbasc irqbypassc ipmi_msghandlerc mei_mec meic buttonc sch_fq_cod
<4>[1401585.395802] CPU: 3 PID: 20950 Comm: td-agent Not tainted 4.9.24-coreos #1
<4>[1401585.395803] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017
<4>[1401585.395803] task: ffff8aa663a43b80 task.stack: ffff95de373cc000
<4>[1401585.395804] RIP: 0010:[<ffffffff8b0ca613>] c [<ffffffff8b0ca613>] native_queued_spin_lock_slowpath+0x113/0x1a0
<4>[1401585.395804] RSP: 0000:ffff95de373cfae0  EFLAGS: 00000046
<4>[1401585.395805] RAX: 0000000000000000 RBX: ffff95de373cfb68 RCX: ffff8ac6ff259140
<4>[1401585.395805] RDX: ffff8ac6ff4d9140 RSI: 0000000000600101 RDI: ffff8aa6ffa129c8
<4>[1401585.395806] RBP: ffff95de373cfae0 R08: 0000000000100000 R09: 0000000000000000
<4>[1401585.395806] R10: 0000000000004126 R11: 0000000000000000 R12: ffff8ac6ff2529c8
<4>[1401585.395806] R13: ffff8aa6ffa129c8 R14: ffff8ac6ff2529c0 R15: ffff8aa6ffa129c0
<4>[1401585.395807] FS:  00007f1dcc5c5700(0000) GS:ffff8ac6ff240000(0000) knlGS:0000000000000000
<4>[1401585.395807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[1401585.395807] CR2: 0000000001ed3168 CR3: 0000003e490fe000 CR4: 00000000003406e0
<4>[1401585.395808] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[1401585.395808] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[1401585.395808] Stack:
<4>[1401585.395809]  ffff95de373cfaf0c ffffffff8b5d4df8c ffff95de373cfbf0c ffffffff8b11975bc
<4>[1401585.395809]  ffff8ac6ff2529d0c ffff8aa6ffa129d0c 0000000000000000c 0000000000000000c
<4>[1401585.395810]  ffffffff8b1191a0c ffff95de373cfb90c ffff95de373cfb68c ffffffff8b0b6961c
<4>[1401585.395810] Call Trace:
<4>[1401585.395810]  [<ffffffff8b5d4df8>] _raw_spin_lock_irq+0x28/0x30
<4>[1401585.395811]  [<ffffffff8b11975b>] stop_two_cpus+0x15b/0x270
<4>[1401585.395811]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401585.395811]  [<ffffffff8b0b6961>] ? enqueue_entity+0x571/0xdb0
<4>[1401585.395812]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401585.395812]  [<ffffffff8b0a7b90>] ? __migrate_swap_task.part.93+0x80/0x80
<4>[1401585.395812]  [<ffffffff8b0a824a>] migrate_swap+0xba/0x140
<4>[1401585.395813]  [<ffffffff8b0b2f1c>] task_numa_migrate+0x54c/0x9e0
<4>[1401585.395813]  [<ffffffff8b0b3425>] numa_migrate_preferred+0x75/0x80
<4>[1401585.395814]  [<ffffffff8b0b8d39>] task_numa_fault+0x9d9/0xd80
<4>[1401585.395814]  [<ffffffff8b0e7729>] ? hrtimer_try_to_cancel+0x29/0x130
<4>[1401585.395814]  [<ffffffff8b0b81ca>] ? should_numa_migrate_memory+0x5a/0x130
<4>[1401585.395815]  [<ffffffff8b1ba0a3>] handle_mm_fault+0xb23/0x14c0
<4>[1401585.395815]  [<ffffffff8b069942>] __do_page_fault+0x222/0x4b0
<4>[1401585.395815]  [<ffffffff8b069bf2>] do_page_fault+0x22/0x30
<4>[1401585.395816]  [<ffffffff8b5d63b8>] page_fault+0x28/0x30
<4>[1401585.395817] Code: c48 c89 cc2 cc1 ce8 c12 c48 cc1 cea c0c c83 ce8 c01 c83 ce2 c30 c48 c98 c48 c81 cc2 c40 c91 c01 c00 c48 c03 c14 cc5 cc0 ce4 c81 c8b c48 c89 c0a c8b c41 c08 c85 cc0 c75 c09 c<f3> c90 c8b c41 c08 c85 cc0 c74 cf7 c4c c8b c09 c4d c85 cc9 c74 c08 c41 c0f c0d c09 c
<0>[1401585.395817] Kernel panic - not syncing: Hard LOCKUP
<4>[1401585.395818] CPU: 3 PID: 20950 Comm: td-agent Not tainted 4.9.24-coreos #1
<4>[1401585.395818] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017
<4>[1401585.395819]  ffff8ac6ff245af8c ffffffff8b31c193c ffff8ac6ff245e00c ffffffff8b7cf3b7c
<4>[1401585.395819]  ffff8ac6ff245b80c ffffffff8b17f2b5c 0000000000000010c ffff8ac6ff245b90c
<4>[1401585.395820]  ffff8ac6ff245b28c 000000008d0225d4c 0000000000000005c ffffffff8b7bf369c
<4>[1401585.395820] Call Trace:
<4>[1401585.395820]  <NMI>  [<ffffffff8b31c193>] dump_stack+0x63/0x90
<4>[1401585.395821]  [<ffffffff8b17f2b5>] panic+0xe8/0x236
<4>[1401585.395821]  [<ffffffff8b07ad2b>] nmi_panic+0x3b/0x40
<4>[1401585.395821]  [<ffffffff8b129665>] watchdog_overflow_callback+0xd5/0xe0
<4>[1401585.395822]  [<ffffffff8b16cf7f>] __perf_event_overflow+0x7f/0x1c0
<4>[1401585.395822]  [<ffffffff8b178a44>] perf_event_overflow+0x14/0x20
<4>[1401585.395822]  [<ffffffff8b00c884>] intel_pmu_handle_irq+0x1e4/0x4a0
<4>[1401585.395823]  [<ffffffff8b31e277>] ? ioremap_page_range+0x287/0x3e0
<4>[1401585.395823]  [<ffffffff8b1c7cbc>] ? vunmap_page_range+0x20c/0x340
<4>[1401585.395824]  [<ffffffff8b1c7e01>] ? unmap_kernel_range_noflush+0x11/0x20
<4>[1401585.395824]  [<ffffffff8b3d35be>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
<4>[1401585.395824]  [<ffffffff8b3d37d8>] ? ghes_read_estatus+0x98/0x170
<4>[1401585.395825]  [<ffffffff8b005a3d>] perf_event_nmi_handler+0x2d/0x50
<4>[1401585.395825]  [<ffffffff8b0320f6>] nmi_handle+0x66/0x120
<4>[1401585.395825]  [<ffffffff8b032674>] default_do_nmi+0x44/0x110
<4>[1401585.395826]  [<ffffffff8b03282c>] do_nmi+0xec/0x140
<4>[1401585.395826]  [<ffffffff8b5d6721>] end_repeat_nmi+0x1a/0x1e
<4>[1401585.395826]  [<ffffffff8b0ca613>] ? native_queued_spin_lock_slowpath+0x113/0x1a0
<4>[1401585.395827]  [<ffffffff8b0ca613>] ? native_queued_spin_lock_slowpath+0x113/0x1a0
<4>[1401585.395827]  [<ffffffff8b0ca613>] ? native_queued_spin_lock_slowpath+0x113/0x1a0
<4>[1401585.395827]  <EOE>  [<ffffffff8b5d4df8>] _raw_spin_lock_irq+0x28/0x30
<4>[1401585.395828]  [<ffffffff8b11975b>] stop_two_cpus+0x15b/0x270
<4>[1401585.395828]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401585.395828]  [<ffffffff8b0b6961>] ? enqueue_entity+0x571/0xdb0
<4>[1401585.395829]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401585.395829]  [<ffffffff8b0a7b90>] ? __migrate_swap_task.part.93+0x80/0x80
<4>[1401585.395830]  [<ffffffff8b0a824a>] migrate_swap+0xba/0x140
<4>[1401585.395830]  [<ffffffff8b0b2f1c>] task_numa_migrate+0x54c/0x9e0
<4>[1401585.395830]  [<ffffffff8b0b3425>] numa_migrate_preferred+0x75/0x80
<4>[1401585.395831]  [<ffffffff8b0b8d39>] task_numa_fault+0x9d9/0xd80
<4>[1401585.395831]  [<ffffffff8b0e7729>] ? hrtimer_try_to_cancel+0x29/0x130
<4>[1401585.395831]  [<ffffffff8b0b81ca>] ? should_numa_migrate_memory+0x5a/0x130
<4>[1401585.395832]  [<ffffffff8b1ba0a3>] handle_mm_fault+0xb23/0x14c0
<4>[1401585.395832]  [<ffffffff8b069942>] __do_page_fault+0x222/0x4b0
<4>[1401585.395832]  [<ffffffff8b069bf2>] do_page_fault+0x22/0x30
<4>[1401585.395833]  [<ffffffff8b5d63b8>] page_fault+0x28/0x30
<0>[1401586.853943] Shutting down cpus with NMI
<0>[1401586.853943] Kernel Offset: 0xa000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<0>[1401586.853948] NMI watchdog: Watchdog detected hard LOCKUP on cpu 23dModules linked in:c binfmt_miscc udp_diagc tcp_diagc inet_diagc xt_setc xt_multiportc iptable_manglec iptable_rawc ip_set_hash_ipc ip_set_hash_netc ip_setc ipipc tunnel4c ip_tunnelc vethc nf_conntrack_netlinkc nfnetlinkc xt_statisticc xt_natc xt_recentc ipt_REJECTc nf_reject_ipv4c nfsv3c nfs_aclc rpcsec_gss_krb5c auth_rpcgssc nfsv4c nfsc lockdc gracec sunrpcc fscachec xt_commentc xt_markc ipt_MASQUERADEc nf_nat_masquerade_ipv4c xfrm_userc xfrm_algoc iptable_natc nf_conntrack_ipv4c nf_defrag_ipv4c nf_nat_ipv4c xt_addrtypec iptable_filterc xt_conntrackc nf_natc nf_conntrackc br_netfilterc bridgec overlayc 8021qc garpc mrpc stpc llcc coretempc sb_edacc ipmi_ssifc i2c_corec edac_corec ipmi_devintfc nls_asciic nls_cp437c vfatc fatc x86_pkg_temp_thermalc kvm_intelc kvmc ipmi_sic evdevc dcdbasc irqbypassc ipmi_msghandlerc mei_mec meic buttonc sch_fq_co
<4>[1401586.853949] CPU: 23 PID: 20945 Comm: td-agent Not tainted 4.9.24-coreos #1
<4>[1401586.853949] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017
<4>[1401586.853950] task: ffff8aa6a7b43b80 task.stack: ffff95de373a4000
<4>[1401586.853950] RIP: 0010:[<ffffffff8b0ca62e>] c [<ffffffff8b0ca62e>] native_queued_spin_lock_slowpath+0x12e/0x1a0
<4>[1401586.853951] RSP: 0000:ffff95de373a7ae0  EFLAGS: 00000002
<4>[1401586.853951] RAX: 0000000000000000 RBX: ffff95de373a7b68 RCX: ffff8ac6ff4d9140
<4>[1401586.853952] RDX: 0000000000400101 RSI: 0000000000000101 RDI: ffff8aa6ffa129c8
<4>[1401586.853952] RBP: ffff95de373a7ae0 R08: 0000000000600000 R09: 0000000000000000
<4>[1401586.853953] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8ac6ff4d29c8
<4>[1401586.853953] R13: ffff8aa6ffa129c8 R14: ffff8ac6ff4d29c0 R15: ffff8aa6ffa129c0
<4>[1401586.853953] FS:  00007f1dccdcd700(0000) GS:ffff8ac6ff4c0000(0000) knlGS:0000000000000000
<4>[1401586.853954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[1401586.853954] CR2: 0000000000a423b0 CR3: 0000003e490fe000 CR4: 00000000003406e0
<4>[1401586.853955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[1401586.853955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[1401586.853955] Stack:
<4>[1401586.853956]  ffff95de373a7af0c ffffffff8b5d4df8c ffff95de373a7bf0c ffffffff8b11975bc
<4>[1401586.853956]  ffff8ac6ff4d29d0c ffff8aa6ffa129d0c 0000000000000000c 0000000000000000c
<4>[1401586.853956]  ffffffff8b1191a0c ffff95de373a7b90c ffff95de373a7b68c ffffffff8b0b6961c
<4>[1401586.853957] Call Trace:
<4>[1401586.853957]  [<ffffffff8b5d4df8>] _raw_spin_lock_irq+0x28/0x30
<4>[1401586.853957]  [<ffffffff8b11975b>] stop_two_cpus+0x15b/0x270
<4>[1401586.853958]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401586.853958]  [<ffffffff8b0b6961>] ? enqueue_entity+0x571/0xdb0
<4>[1401586.853958]  [<ffffffff8b1191a0>] ? cpu_stop_queue_work+0x90/0x90
<4>[1401586.853959]  [<ffffffff8b0a7b90>] ? __migrate_swap_task.part.93+0x80/0x80
<4>[1401586.853959]  [<ffffffff8b0a824a>] migrate_swap+0xba/0x140
<4>[1401586.853960]  [<ffffffff8b0b2f1c>] task_numa_migrate+0x54c/0x9e0
<4>[1401586.853960]  [<ffffffff8b0b3425>] numa_migrate_preferred+0x75/0x80
<4>[1401586.853960]  [<ffffffff8b0b8d39>] task_numa_fault+0x9d9/0xd80
<4>[1401586.853961]  [<ffffffff8b0e7729>] ? hrtimer_try_to_cancel+0x29/0x130
<4>[1401586.853961]  [<ffffffff8b0b81ca>] ? should_numa_migrate_memory+0x5a/0x130
<4>[1401586.853961]  [<ffffffff8b1ba0a3>] handle_mm_fault+0xb23/0x14c0
<4>[1401586.853962]  [<ffffffff8b069942>] __do_page_fault+0x222/0x4b0
<4>[1401586.853962]  [<ffffffff8b069bf2>] do_page_fault+0x22/0x30
<4>[1401586.853962]  [<ffffffff8b5d63b8>] page_fault+0x28/0x30
<4>[1401586.853963] Code: c14 cc5 cc0 ce4 c81 c8b c48 c89 c0a c8b c41 c08 c85 cc0 c75 c09 cf3 c90 c8b c41 c08 c85 cc0 c74 cf7 c4c c8b c09 c4d c85 cc9 c74 c08 c41 c0f c0d c09 ceb c02 cf3 c90 c8b c17 c<66> c85 cd2 c75 cf7 cbe c01 c00 c00 c00 ceb c10 c89 cd0 cf0 c0f cb1 c37 c39 cc2 c0f c
bgilbert commented 4 years ago

Container Linux 1353.7.0 was released three years ago and is running an ancient 4.9.24 kernel. We do not support versions of Container Linux older than the current releases on the alpha, beta, or stable channels.

Note that CoreOS Container Linux will reach end-of-life on May 26. We recommend that you begin migrating to a different operating system.

mtinny commented 4 years ago

Thank you for your suggestion. Let us consider to move to Fedora CoreOS.