Open Eshakk opened 7 years ago
In short, timedated and timesyncd become unresponsive, so systemd tries to kill them -- and fails. This implies that the system as a whole is already quite bogged down, so it's not surprising that it soon stops responding altogether. This could be a symptom of a kernel problem or of extremely heavy I/O.
Are there any hints earlier in the logs?
This is another instance happened today. This time there are no signs of systemd trying to kill unresponsive services. Logging suddenly stopped.
journalctl -k
for kernel logs:
kernel_nov9_vm_hang.log
There is no heavy I/O on our environment. At the time VM was unresponsive, average disk usage was 313KBps and 40ms highest latency.
How can I get more logs from kernel to know the point of failure ?
Maybe add more memory and see what happens.
From the logs, it looks as though the VM has 60 GB of RAM?
You should be able to have ESXi connect the VM's first serial port to a file. Any kernel logs that don't make it to disk should still be printed to the serial port.
@glevand We were using 17211820 Kb of memory out of 60GB at the time VM hung.
@bgilbert
I've managed to get couple of screen shots of top just before system went unresponsive.
In the first two screenshots, I'm expecting some process to consume very high amount if CPU and causing around 100 processes to wait for CPU. But, screenshots show CPU is sitting at ~90% idle.
Immediately after the third screenshot, I got kicked off the server. Got a Broken pipe message in my ssh session. In another instance of this issue I have seen negative CPU consumption and CPU spending 87% of time in hardware interrupts:
Do you find anything from these ? Hardware issue ? user application issue ? kernel issue etc,.? Any pointers would be very helpful.
The negative CPU consumption is strange. Also, that's a very high load average. Are there many processes in D state?
The best way to obtain more troubleshooting information would be to collect kernel logs from the VM's virtual serial port at the time of the crash.
@bgilbert, Thanks for the suggestion. I have configured a virtual serial port and unfortunately there are no messages in it after the freeze. Up until freeze, everything looks normal it, there are no errors/warnings.
We are seeing this issue from 4.13.x kernel onwards. We have rolled back couple of our VMs to 4.12.14 and we did not see crash for a week.
The lack of log messages from the guest could imply a hypervisor problem. Does ESXi produce any relevant logs?
We have checked all the vmware logs. There are no errors/backtraces/warnings in vm*.log files. We can see that ESXi host PCPUs are spending most of its time in C0 state using P0 associated power. I could not strace the processes in VMkernel to see what it is doing.
Did you hear about any freezes in CoreOS with 4.13.x kernel version ?
I haven't heard of any other complete lockups of the kind you've described.
Does these make any sense ? Do you think there might be an issue with vmware tools ?
@bgilbert Do you have any update regarding this issue ? We are hitting this frequently in our environment. Since we have verified all other components in our setup(vmware, ESXi host, applications), only thing left is CoreOS.
And we are sure that, we have not seen this issue in 4.12.x and nodes started to freeze from 4.13.x. We have couple of nodes running 4.12.x(CoreOS 1465.8.0) running fine since 23 days right now. Nodes with 1520.9.0 are freezing once every two/three days.
Unfortunately, without logs or debug info of some sort, there's not an obvious path to make progress with this issue. The alpha and beta channels have been updated to 4.14.x; are you able to try the current beta?
Unfortunately, we cannot try 4.14.x kernel in our product. We are in sync with stable channel but stopped updating our nodes to not go beyond 4.12.x until this freeze issue is resolved. We have couple of nodes running on 4.13.x kernel to provide you guys with debug logs. I've enabled kernel module debug logs for vmw_pvscsi and vm_vmci modules. At the time of freeze we are seeing no errors but these messages and the node froze after the last message:
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c3bcf839148, ctx ffff9c37fd7ef700, op=2a
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad62548 2a ctx=ffff9c37fd7ed400 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad63d48 2a ctx=ffff9c37fd7edb00 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad60548 2a ctx=ffff9c37fd7ed4c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad66d48 2a ctx=ffff9c37fd7ed5c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad66148 2a ctx=ffff9c37fd7efb40 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3bcf839548 2a ctx=ffff9c37fd7eea80 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3bcf83e548 2a ctx=ffff9c37fd7efa40 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3bcf839148 2a ctx=ffff9c37fd7ef700 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad62148 2a ctx=ffff9c37fd7ee300 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c7ad64548 2a ctx=ffff9c37fd7ee200 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3bcf838948 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c3c739f3148, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c739f3148 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c3b9c078148, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3b9c078148 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c3c739f4548, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c739f4548 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c3c78b21548, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:51 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c3c78b21548 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c443c2f1548, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c443c2f1548 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c42972d1148, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c42972d1148 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: queued cmd ffff9c42972d1148, ctx ffff9c37fd7ec3c0, op=2a
Dec 09 16:28:52 lys3.server kernel: sd 0:0:0:0: cmd=ffff9c42972d1148 2a ctx=ffff9c37fd7ec3c0 result=0x0 status=0x0,0
@bgilbert We have seen another freeze in the morning today. Could you please suggest more ways to get appropriate logs ? Do you need debug logs of any specific kernel module ? Below are the modules just for reference:
user@lys3 ~ $ lsmod
Module Size Used by
veth 16384 0
xt_nat 16384 13
ipt_MASQUERADE 16384 11
nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE
nf_conntrack_netlink 32768 0
nfnetlink 16384 2 nf_conntrack_netlink
xfrm_user 36864 1
xfrm_algo 16384 1 xfrm_user
iptable_nat 16384 1
nf_nat_ipv4 16384 1 iptable_nat
xt_addrtype 16384 2
nf_nat 24576 3 xt_nat,nf_nat_masquerade_ipv4,nf_nat_ipv4
br_netfilter 24576 0
bridge 118784 1 br_netfilter
stp 16384 1 bridge
llc 16384 2 bridge,stp
overlay 61440 7
vmw_vsock_vmci_transport 24576 1
vsock 32768 2 vmw_vsock_vmci_transport
xt_comment 16384 16
nf_conntrack_ipv4 16384 27
nf_defrag_ipv4 16384 1 nf_conntrack_ipv4
xt_conntrack 16384 2
nf_conntrack 106496 8 xt_nat,nf_conntrack_ipv4,ipt_MASQUERADE,nf_conntrack_netlink,nf_nat_masquerade_ipv4,xt_conntrack,nf_nat_ipv4,nf_nat
libcrc32c 16384 2 nf_conntrack,nf_nat
crc32c_generic 16384 0
iptable_filter 16384 1
nls_ascii 16384 1
nls_cp437 20480 1
vfat 20480 1
fat 61440 1 vfat
mousedev 20480 0
psmouse 36864 0
i2c_piix4 24576 0
i2c_core 49152 2 i2c_piix4,psmouse
vmw_balloon 20480 0
sb_edac 24576 0
vmw_vmci 61440 2 vmw_balloon,vmw_vsock_vmci_transport
edac_core 40960 1 sb_edac
evdev 24576 2
button 16384 0
sch_fq_codel 20480 9
ext4 536576 3
crc16 16384 1 ext4
mbcache 16384 1 ext4
jbd2 86016 1 ext4
fscrypto 24576 1 ext4
dm_verity 20480 1
dm_bufio 24576 1 dm_verity
sr_mod 24576 0
cdrom 53248 1 sr_mod
sd_mod 45056 5
crc32c_intel 20480 1
aesni_intel 163840 0
aes_x86_64 20480 1 aesni_intel
crypto_simd 16384 1 aesni_intel
cryptd 20480 2 crypto_simd,aesni_intel
glue_helper 16384 1 aesni_intel
ata_piix 36864 0
libata 200704 1 ata_piix
vmw_pvscsi 24576 4
vmxnet3 53248 0
scsi_mod 192512 4 sd_mod,libata,sr_mod,vmw_pvscsi
dm_mirror 24576 0
dm_region_hash 16384 1 dm_mirror
dm_log 20480 2 dm_mirror,dm_region_hash
dm_mod 102400 6 dm_mirror,dm_log,dm_bufio,dm_verity
dax 24576 1 dm_mod
user@lys3 ~ $
@bgilbert Could you please let me know your thoughts on this issue ?
@Eshakk You could try changing the kernel log level to include all debug messages. Otherwise I currently have no further thoughts beyond https://github.com/coreos/bugs/issues/2226#issuecomment-351844917.
@bgilbert We have generated a memory dump of our VM at the time of freeze. It seems like we need debugging symbols to read this memory image. Could you please guide how to read the memory dump ?
@bgilbert We have created vmss(states file) and vmem files by suspending the VM when it froze. We are able to construct a file from vmss and attaching to this issue.
These are our findings:
Please let me know if you find anything in this states file. We are trying to relate these finding to our issue.
@bgilbert I believe from 1520 onwards CoreOS image should have debugging symbols. Why the uncompressed kernel image is showing that there are none ?
[root@lys3 debug]# crash -g vmlinux_uncompressed
crash 7.1.9-3.fc27
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
crash: /dev/tty: No such device or address
NOTE: stdin: not a tty
crash: vmlinux_uncompressed: no .gnu_debuglink section
crash: vmlinux_uncompressed: no debugging data available
[root@lys3 debug]#
Could you please help me out how to read the memory image of CoreOS VM ?
Debug symbols are not used in production images. They are available in binary packages, e.g. http://builds.developer.core-os.net/boards/amd64-usr/1576.4.0/pkgs/sys-kernel/coreos-kernel-4.13.16-r2.tbz2
@dm0- Thank you very much David. Will try this.
@dm0- After using the above mentioned packages:
[root@lys3 ~]# crash /home/user/debug/4_13_16_kernel/usr/lib/debug/usr/boot/vmlinux-4.13.16-coreos-r2 /home/user/debug/vmss.core
crash 7.1.9-3.fc27
Copyright (C) 2002-2016 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
crash: Elf64_Phdr pointer: 2f05510 ELF header end: 2f054d4
[root@lys3 ~]#
It is returning with status 1.
Do I need to use specific options while starting toolbox ?
I'm starting using toolbox --bind /home/user
@dm0- Hello David, Hope you enjoyed your holidays. Could you please help me with the toolbox issue I've posted above ?
From the source, it looks like that error message means it thinks your core file is an invalid netdump file.
@dm0- We have generated the vmss/vmem files using suspend in VMware and the core file using vmss2core tool. Is there any other standard way to do this ?
Sorry, I am not familiar with this. VMware is a community supported platform, so maybe someone on the coreos-user mailing list or #coreos IRC channel would be knowledgeable about the issue.
We've been experiencing this issue on bare metal servers on both 1576.4.0 and 1576.5.0.
We have no logs at all, the kernel watchdogs do not do or log anything either. We've setup the system to panic when when we send an ipmi power diag
command (sysctl: kernel.unknown_nmi_panic=1
) and nothing happens. Same for sysrq via the console. During this whole time there is nothing via ipmi console. Despite this we can see the cpu temperature climbing from their normal 50 (cpu1) and 46 (cpu2) to 69 and 63 respectively. There is nothing coming out of the servers according to the switch. We've spent some time to setup a crash kernel to catch these kind of things but nothing there as well.
This is has happened 2 times within 1 day of putting production workload on them, on 2 different servers. We've begun rolling out this update last week.
At this point we're looking at trying 1465.8.0 like @eshakk but whether it works or not our only way forward is to move off CoreOS if we can't find a solution.
We're running on Supermicro X9DRFF-i+ with 2 Intel(R) Xeon(R) CPU E5-2680 v2 and 64GB of ram. Let me know if how I can provide any additional info.
@mathpl based on that information, it's hard to tell whether your issue is the same. It might be better to file it as a new bug since the hardware is so different (esxi vs baremetal).
It could be worth trying the current alpha as well to see if intel microcode updates help with the issue.
It could also be possible to run a kernel bisect if you can find a reliable enough way to trigger it, though that's certainly easier said than done.
Unfortunately, without system logs or a reliable reproduction, it's difficult to do more than take guesses at what it might be.
WFM, will create a new issue.
It has been identified as printk is causing the kernel hang. After analyzing the memory image(with help of VMware), we are able to find the stack trace of the task running at the time of freeze:
crash> bt -S ffffc9001148c000
PID: 33390 TASK: ffff8803a865dac0 CPU: 3 COMMAND: "media_switch_proc"
#0 [ffffc9001148c000] __schedule at ffffffff815fd103
#1 [ffffc9001148f4c8] vsnprintf at ffffffff815f05c1
#2 [ffffc9001148f520] sprintf at ffffffff815f0891
#3 [ffffc9001148f588] __sprint_symbol at ffffffff811025ad
#4 [ffffc9001148f5d8] symbol_string at ffffffff815eed51
#5 [ffffc9001148f728] __sprint_symbol at ffffffff81102595
#6 [ffffc9001148f778] symbol_string at ffffffff815eed51
#7 [ffffc9001148f8b8] _raw_spin_lock at ffffffff81602add
#8 [ffffc9001148f8c0] vprintk_emit at ffffffff810ca91b
#9 [ffffc9001148f918] vprintk_deferred at ffffffff810cb41c
#10 [ffffc9001148f928] printk_deferred at ffffffff810cb88b
#11 [ffffc9001148f990] __printk_safe_flush at ffffffff810cbaee
#12 [ffffc9001148f9f0] printk_safe_flush at ffffffff810cbc0b
#13 [ffffc9001148fa18] crash_kexec at ffffffff81104aa0
#14 [ffffc9001148fa30] oops_end at ffffffff8102dce3
#15 [ffffc9001148fa50] general_protection at ffffffff818016ac
[exception RIP: string+33]
RIP: ffffffff815ee771 RSP: ffffc9001148fb00 RFLAGS: 00010086
RAX: ffffffff81df6813 RBX: ffffffff846ce502 RCX: ffff0a00ffffff04
RDX: 0001005280000880 RSI: ffffffff846ce8e0 RDI: ffffffff846ce502
RBP: ffffffff846ce8e0 R8: 0000000000000033 R9: ffffffffffffffff
R10: ffffffff846ce502 R11: ffff8802ea2f44bc R12: ffffc9001148fbc8
R13: 00000000000003e0 R14: ffffffff81e518fc R15: ffffffff81e518fc
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#16 [ffffc9001148fb00] vsnprintf at ffffffff815f037c
#17 [ffffc9001148fb58] vscnprintf at ffffffff815f0719
#18 [ffffc9001148fb68] vprintk_emit at ffffffff810ca933
#19 [ffffc9001148fbc0] printk at ffffffff810cb7bf
#20 [ffffc9001148fc28] netdev_rx_csum_fault at ffffffff814d5520
#21 [ffffc9001148fc38] skb_copy_and_csum_datagram_msg at ffffffff814cbbab
#22 [ffffc9001148fc80] udp_recvmsg at ffffffff81552f2d
#23 [ffffc9001148fd08] inet_recvmsg at ffffffff8155fdaa
#24 [ffffc9001148fd28] ___sys_recvmsg at ffffffff814ba88e
#25 [ffffc9001148fea0] __sys_recvmsg at ffffffff814bbf5e
#26 [ffffc9001148ff28] do_syscall_64 at ffffffff81003949
RIP: 00007f66c2f088ad RSP: 00007ffccb146240 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 00007ffccb146290 RCX: 00007f66c2f088ad
RDX: 0000000000000000 RSI: 00007ffccb146290 RDI: 0000000000000054
RBP: 000000000000001c R8: 0000000002137c28 R9: 0000000000000001
R10: 00000000ffffff00 R11: 0000000000000293 R12: 00007ffccb146290
R13: 00007f66c43b36a0 R14: 0000000000c90c10 R15: 0000000020a084a0
ORIG_RAX: 000000000000002f CS: 0033 SS: 002b
crash>
The above stack trace indicates that there is a bogus pointer passed to netdev_rx_csum_fault which is causing general_protection and printk to hang without releasing logbuf_lock lock.
Now, bad things happen all the time in production, but I do not think a kernel should hang on such error message.
@bgilbert @euank @dm0- or anyone, Can someone please tell me
Please let me know if you have any questions.
Issue Report
Bug
We have been seeing this issue since 1520.6.0. All of a sudden, our VM stops responding and only hard reset from hypervisor is only the solution to get it back online. When we check the logs after hard reset of VM, surprisingly, there are no system logs at all. Logs just got cut off and VM is unresponsive. The below is snippet of such scenario:
sudo journalctl
:There are no logs just before
-- Reboot --
test in journalctl. Could someone please tell me what exactly is happening at the time of reboot ? Why are timedated and timesyncd are being killed ? Under what circumstances SIGKILL, SIGTERM will fail in linux ? What does "Watchdog timeout" mean ? Does this mean that kernel is trying to protect cpu ?Container Linux Version
Environment
We are running our VM in ESXi 6.0.0 sever and this is only VM associated for that sever.
Expected Behavior
These has to be system logs to know what is happening.
Actual Behavior
Logs got cut off and timedated and timesyncd are getting killed.
Reproduction Steps
Unknown. It happens randomly on some VMs.