victorgp commented 8 years ago

Issue Report

Bug

CoreOS Version

NAME=CoreOS ID=coreos VERSION=1010.5.0 VERSION_ID=1010.5.0 BUILD_ID=2016-05-26-2225 PRETTY_NAME="CoreOS 1010.5.0 (MoreOS)" ANSI_COLOR="1;32" HOME_URL="https://coreos.com/" BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

Baremetal servers

Expected Behavior

OS doesn't reboot by a kernel panic

Actual Behavior

After some minutes the server reboots due to a kernel panic

Reproduction Steps

We've been using the stable 1010.5.0 version since it was released and we didn't have any issue. We added (using Kubernetes) more and more containers until it seems we have reached a limit were a kernel panic was provoked. The moment we start Docker we start seeing in journald dmesg errors like:

SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)

And after a bunch of those errors, the kernel panic happens and the server reboots, this is the stack trace:

[  118.756008] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  120.056490] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  124.116552] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  129.707351] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  136.250866] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[  178.340057] general protection fault: 0000 [#1] SMP
[  178.346083] Modules linked in: binfmt_misc xt_statistic xt_nat xt_mark ipt_REJECT nf_reject_ipv4 xt_comment veth x4
[  178.435934] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.5.0-coreos-r1 #2
[  178.443867] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
[  178.452768] task: ffffffffa5a11540 ti: ffffffffa5a00000 task.ti: ffffffffa5a00000
[  178.461667] RIP: 0010:[<ffffffffa50a9922>]  [<ffffffffa50a9922>] update_blocked_averages+0x392/0x500
[  178.472474] RSP: 0018:ffff88103f603df0  EFLAGS: 00010006
[  178.478698] RAX: e8ffff881ecc86d2 RBX: e8ffff881ecc85fa RCX: 0000000000000007
[  178.486982] RDX: 0000000000000001 RSI: 0000000088ffff88 RDI: ffffffffffffffff
[  178.495280] RBP: ffff88103f603e50 R08: 0000000000000003 R09: ffff88203ee760d8
[  178.503544] R10: 0000000000000004 R11: afb504000afb5041 R12: 000000000000020d
[  178.511829] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  178.520133] FS:  0000000000000000(0000) GS:ffff88103f600000(0000) knlGS:0000000000000000
[  178.529676] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  178.536398] CR2: 000000c207f31c2d CR3: 0000002032a86000 CR4: 00000000001406f0
[  178.544699] Stack:
[  178.547280]  ffff88103f60da30 0000000000000286 ffff88203ee75800 ffff88203ee760d8
[  178.556411]  afb504000afb5041 003813563f603e10 95e47a51a0eb7fa5 0000000000000007
[  178.565539]  00000000fffe2187 0000000000000000 ffff88203ee75800 0000000000000007
[  178.574716] Call Trace:
[  178.577735]  <IRQ>
[  178.579968]  [<ffffffffa50b2ddb>] rebalance_domains+0x4b/0x2e0
[  178.587451]  [<ffffffffa50a5def>] ? sched_clock_cpu+0x8f/0xa0
[  178.594246]  [<ffffffffa50b31f6>] run_rebalance_domains+0x186/0x210
[  178.601599]  [<ffffffffa507a79b>] __do_softirq+0xfb/0x280
[  178.607933]  [<ffffffffa507aa9c>] irq_exit+0x9c/0xa0
[  178.613781]  [<ffffffffa556ea72>] smp_apic_timer_interrupt+0x42/0x50
[  178.621200]  [<ffffffffa556cd42>] apic_timer_interrupt+0x82/0x90
[  178.628199]  <EOI>
[  178.630420]  [<ffffffffa5435177>] ? cpuidle_enter_state+0x107/0x250
[  178.638255]  [<ffffffffa5435153>] ? cpuidle_enter_state+0xe3/0x250
[  178.645453]  [<ffffffffa54352f7>] cpuidle_enter+0x17/0x20
[  178.651831]  [<ffffffffa50b905a>] call_cpuidle+0x2a/0x40
[  178.658056]  [<ffffffffa50b9425>] cpu_startup_entry+0x295/0x350
[  178.671023]  [<ffffffffa555f3cc>] rest_init+0x7c/0x80
[  178.676984]  [<ffffffffa5b2e013>] start_kernel+0x497/0x4b8
[  178.683497]  [<ffffffffa5b2d120>] ? early_idt_handler_array+0x120/0x120
[  178.691217]  [<ffffffffa5b2d4d7>] x86_64_start_reservations+0x2a/0x2c
[  178.698744]  [<ffffffffa5b2d623>] x86_64_start_kernel+0x14a/0x16d
[  178.705902] Code: 48 8d 98 28 ff ff ff 0f 85 00 fd ff ff 48 8b 74 24 08 48 8b 7c 24 10 e8 9d 21 4c 00 48 8d 65 d8
[  178.733365] RIP  [<ffffffffa50a9922>] update_blocked_averages+0x392/0x500
[  178.741322]  RSP <ffff88103f603df0>
[  178.745510] ---[ end trace b40c1715e81b17aa ]---
[  178.752684] Kernel panic - not syncing: Fatal exception in interrupt
[  179.797690] Shutting down cpus with NMI
[  180.031816] Kernel Offset: 0x24000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfff)
[  180.047736] Rebooting in 10 seconds..
[  190.052188] ACPI MEMORY or I/O RESET_REG.

mischief commented 8 years ago

the error SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) is not related to the panic afaict, and happens during 'normal' operation.

can you try to reproduce this with alpha 1068.0.0 with the 4.6.0 kernel?

victorgp commented 8 years ago

@mischief Yes, this doesn't happen with 1068.0

Having a server with stable version and a server with alpha version, running the same containers (moving them with Kubernetes) the kernel panic is easily reproducible in the stable version, right after some minutes of having some containers running, it crashes. And, the alpha version seems robust, it doesn't crash.

I'm surprised you quickly proposed using alpha version, is this issue something you were already aware? is this related to the new kernel version?

It looks like the stable CoreOS version is not so stable, luckily we weren't running this in production, because this took our whole cluster down.

mischief commented 8 years ago

@victorgp no, not an issue i've been aware of. it's just that sometimes bugs are fixed in newer kernels, so it's always worth a try to get another data point.

alogoc commented 8 years ago

I can confirm having this very same error after upgrading to stable 1122.2.0 although it doesn't lead to kernel panic but the server will hang if I initiate a reboot unless it's forcefully rebooted.

NAME=CoreOS
ID=coreos
VERSION=1122.2.0
VERSION_ID=1122.2.0
BUILD_ID=2016-09-06-1449
PRETTY_NAME="CoreOS 1122.2.0 (MoreOS)"

victorgp commented 8 years ago

@marineam @mischief is this a bug confirmed in 1122.2.0? I'd like to know before upgrading my nodes. Thanks

mjg59 commented 8 years ago

The message is harmless and unrelated to any other issues that are being seen. Please open a separate issue for other specific problems (such as the failure to reboot) so we can ensure that they're handled appropriately, thanks!

mjg59 commented 8 years ago

@victorgp The crash you were seeing should certainly be fixed in 1122.

coreos / bugs

CoreOS stable rebooting due to kernel panic: "SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)" #1410

Issue Report

Bug

CoreOS Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps