coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Kernel Panic on GCE with IPsec #2498

Open rafaelgaspar opened 6 years ago

rafaelgaspar commented 6 years ago

Issue Report

Kernel Panic on GCE with IPsec

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.7.0
VERSION_ID=1800.7.0
BUILD_ID=2018-08-15-2254
PRETTY_NAME="Container Linux by CoreOS 1800.7.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Multiple machine types from 2 to 16 cores on Google Compute Engine(GCE), with and without local NVME SSDs, all Running Rancher with IPsec.

Expected Behavior

To respect Douglas Adams and do not panic.

Actual Behavior

Multiple kernel panics all across the cluster.

Other Information

Follows the panic:

[682278.171925] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[682278.180027] IP: xfrm_lookup+0x2f/0x900
[682278.183984] PGD 0 P4D 0 
[682278.186723] Oops: 0000 [#1] SMP PTI
[682278.190415] Modules linked in: seqiv xt_set ip_set_hash_net ip_set iptable_raw vxlan ip6_udp_tunnel udp_tunnel xt_nat xt_mark xfrm6_mode_tunnel xfrm4_mode_tunnel esp4 drbg veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay nls_ascii nls_cp437 vfat fat mousedev psmouse i2c_piix4 sb_edac i2c_core edac_core evdev button pvpanic sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto btrfs xor dm_verity zstd_decompress dm_bufio zstd_compress xxhash lzo_compress raid6_pq sd_mod crc32c_intel aesni_intel aes_x86_64 virtio_scsi crypto_simd cryptd glue_helper scsi_mod nvme virtio_net nvme_core dm_mirror
[682278.261582]  dm_region_hash dm_log dm_mod dax
[682278.266138] CPU: 5 PID: 4796 Comm: redis-server Not tainted 4.14.59-coreos-r2 #1
[682278.273723] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[682278.283133] task: ffff9991aff99e00 task.stack: ffffabf8c3ec0000
[682278.289282] RIP: 0010:xfrm_lookup+0x2f/0x900
[682278.293759] RSP: 0018:ffff99923fd43c88 EFLAGS: 00010246
[682278.299180] RAX: 0000000000000000 RBX: ffff99923fd43d00 RCX: 0000000000000000
[682278.306506] RDX: ffff99923fd43d00 RSI: 0000000000000000 RDI: ffff9991b71f8000
[682278.313829] RBP: ffff9991b71f8000 R08: 0000000000000002 R09: 0000000000000018
[682278.321154] R10: 0000000000000001 R11: 00000000093ef1fa R12: 0000000000000002
[682278.328479] R13: ffff998f78b5e89c R14: ffff9991b71f8000 R15: 0000000000000000
[682278.335805] FS:  00007f1900709f00(0000) GS:ffff99923fd40000(0000) knlGS:0000000000000000
[682278.344088] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[682278.350024] CR2: 0000000000000020 CR3: 00000006dfac2000 CR4: 00000000001406e0
[682278.357349] Call Trace:
[682278.359987]  <IRQ>
[682278.362203]  __xfrm_route_forward+0x8d/0xf0
[682278.366635]  ip_forward+0x246/0x450
[682278.370322]  ? ip_rcv_finish+0xac/0x400
[682278.374361]  ip_rcv+0x287/0x3a0
[682278.377750]  ? inet_del_offload+0x40/0x40
[682278.381965]  __netif_receive_skb_core+0x432/0xb50
[682278.386863]  ? process_backlog+0x97/0x150
[682278.391066]  process_backlog+0x97/0x150
[682278.395101]  net_rx_action+0x149/0x3d0
[682278.399048]  __do_softirq+0xe7/0x2ca
[682278.402815]  do_softirq_own_stack+0x2a/0x40
[682278.407190]  </IRQ>
[682278.409481]  do_softirq.part.14+0x49/0x50
[682278.413741]  __local_bh_enable_ip+0x55/0x60
[682278.418227]  ip_finish_output2+0x189/0x3c0
[682278.422517]  ? ip_output+0x6c/0xe0
[682278.426112]  ip_output+0x6c/0xe0
[682278.429529]  ? ip_fragment.constprop.45+0x80/0x80
[682278.434427]  __tcp_transmit_skb+0x52e/0x9f0
[682278.438800]  tcp_write_xmit+0x1af/0xed0
[682278.442829]  ? _copy_from_iter_full+0x9c/0x240
[682278.447465]  __tcp_push_pending_frames+0x31/0xd0
[682278.452273]  tcp_sendmsg_locked+0xb06/0xe60
[682278.456654]  tcp_sendmsg+0x27/0x40
[682278.460251]  sock_sendmsg+0x36/0x40
[682278.463934]  sock_write_iter+0x8f/0xf0
[682278.467877]  __vfs_write+0x101/0x160
[682278.471642]  vfs_write+0xad/0x1a0
[682278.475148]  SyS_write+0x52/0xc0
[682278.478568]  do_syscall_64+0x67/0x120
[682278.482422]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[682278.487666] RIP: 0033:0x7f18ffdd81ad
[682278.491433] RSP: 002b:00007fff5448ca50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[682278.499195] RAX: ffffffffffffffda RBX: 00007f18cbbb4e80 RCX: 00007f18ffdd81ad
[682278.506522] RDX: 0000000000000004 RSI: 00007f18cbbb5054 RDI: 00000000000000c8
[682278.513851] RBP: 0000000000000000 R08: 000000000000000f R09: 00007f18ff60e220
[682278.521177] R10: 00007f18ff8000c0 R11: 0000000000000293 R12: 000055b71460fc6c
[682278.528616] R13: 00000000000000c8 R14: 00007f18cbbb5054 R15: 0000000000000000
[682278.535945] Code: 00 41 57 41 56 49 89 f7 41 55 41 54 49 89 fe 55 53 45 89 c4 48 89 d3 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 10 c7 44 24 0c 00 00 00 00 0f 84 
[682278.555021] RIP: xfrm_lookup+0x2f/0x900 RSP: ffff99923fd43c88
[682278.560983] CR2: 0000000000000020
[682278.564526] ---[ end trace afbd074a4e6cec11 ]---
[682278.569339] Kernel panic - not syncing: Fatal exception in interrupt
[682278.576989] Kernel Offset: 0x3000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[682278.587887] Rebooting in 60 seconds..
[682338.711434] ACPI MEMORY or I/O RESET_REG.
lucab commented 6 years ago

@rafaelgaspar thanks for the report. Are the panic log and the os-release coming from the same machine? The kernel trace shows 4.14.59-coreos-r2 but the os-release mentions 1800.7.0 which come with a 4.14.63 kernel. Can you please confirm that you are seeing this on both versions?

rafaelgaspar commented 6 years ago

@lucab yes, it's from the same node, it probably updated with the crash, searching for more panic logs I saw that other nodes also crashed on 4.14.59-coreos-r2 and rebooted on 4.14.63-coreos.

But some are still crashing on the newer kernel, here is a panic log from another node that was already running 4.14.63-coreos after it updated during a previous kernel panic.

[41352.088481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[41352.096722] IP: xfrm_lookup+0x2f/0x900
[41352.100590] PGD 800000192438a067 P4D 800000192438a067 PUD 1924389067 PMD 0 
[41352.107671] Oops: 0000 [#1] SMP PTI
[41352.111267] Modules linked in: seqiv xt_set ip_set_hash_net ip_set iptable_raw vxlan ip6_udp_tunnel udp_tunnel xt_nat xt_mark xfrm6_mode_tunnel xfrm4_mode_tunnel esp4 drbg veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay nls_ascii nls_cp437 vfat fat mousedev psmouse sb_edac i2c_piix4 edac_core i2c_core evdev pvpanic button sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio sd_mod btrfs xor zstd_decompress zstd_compress xxhash lzo_compress raid6_pq crc32c_intel aesni_intel aes_x86_64 virtio_scsi crypto_simd cryptd glue_helper scsi_mod nvme virtio_net nvme_core dm_mirror
[41352.182545]  dm_region_hash dm_log dm_mod dax
[41352.187015] CPU: 9 PID: 6020 Comm: healthcheck Not tainted 4.14.63-coreos #1
[41352.194235] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[41352.203567] task: ffff934723cf3c00 task.stack: ffffb0ef07df4000
[41352.209607] RIP: 0010:xfrm_lookup+0x2f/0x900
[41352.213984] RSP: 0000:ffff9347d7443c48 EFLAGS: 00010246
[41352.219460] RAX: 0000000000000000 RBX: ffff9347d7443cc0 RCX: 0000000000000000
[41352.226785] RDX: ffff9347d7443cc0 RSI: 0000000000000000 RDI: ffff9347241dd600
[41352.234031] RBP: ffff9347241dd600 R08: 0000000000000002 R09: 0000000000000018
[41352.241271] R10: 0000000000000001 R11: 0000000079495078 R12: 0000000000000002
[41352.248512] R13: ffff9335ff11e89c R14: ffff9347241dd600 R15: 0000000000000000
[41352.255763] FS:  00007f49ba7fc700(0000) GS:ffff9347d7440000(0000) knlGS:0000000000000000
[41352.263958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41352.269808] CR2: 0000000000000020 CR3: 000000192423e006 CR4: 00000000001606e0
[41352.277056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[41352.284393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[41352.291633] Call Trace:
[41352.294233]  <IRQ>
[41352.296367]  __xfrm_route_forward+0x8d/0xf0
[41352.300663]  ip_forward+0x246/0x450
[41352.304260]  ? ip_rcv_finish+0xac/0x400
[41352.308202]  ip_rcv+0x287/0x3a0
[41352.311487]  ? inet_del_offload+0x40/0x40
[41352.315612]  __netif_receive_skb_core+0x432/0xb50
[41352.320421]  ? process_backlog+0x97/0x150
[41352.324546]  process_backlog+0x97/0x150
[41352.328489]  net_rx_action+0x149/0x3d0
[41352.332347]  __do_softirq+0xe8/0x2cf
[41352.336030]  irq_exit+0xc2/0xd0
[41352.339276]  do_IRQ+0x89/0xd0
[41352.342350]  common_interrupt+0x7d/0x7d
[41352.346290]  </IRQ>
[41352.348498] RIP: 0033:0x53a116
[41352.351656] RSP: 002b:000000c4203b5898 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffc8
[41352.359333] RAX: 000000c42005e120 RBX: 0000000000000034 RCX: 000000c420226000
[41352.366606] RDX: 0000000000944448 RSI: 0000000000100eda RDI: 0000000000000000
[41352.373850] RBP: 000000c4203b58b0 R08: 0000000000000000 R09: 0000000000000000
[41352.381087] R10: 0000000000000003 R11: 0000000000000199 R12: 0055555555555555
[41352.388329] R13: 00000000000000f5 R14: 0000000000000075 R15: 0000000000000000
[41352.395623] Code: 00 41 57 41 56 49 89 f7 41 55 41 54 49 89 fe 55 53 45 89 c4 48 89 d3 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 10 c7 44 24 0c 00 00 00 00 0f 84 
[41352.414742] RIP: xfrm_lookup+0x2f/0x900 RSP: ffff9347d7443c48
[41352.420602] CR2: 0000000000000020
[41352.424036] ---[ end trace 5acadd3621d299c6 ]---
[41352.428768] Kernel panic - not syncing: Fatal exception in interrupt
[41352.436420] Kernel Offset: 0x37000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[41352.447318] Rebooting in 60 seconds..
[41412.558921] ACPI MEMORY or I/O RESET_REG.
bgilbert commented 5 years ago

@rafaelgaspar Are you still seeing this on current versions of Container Linux?