Open rafaelgaspar opened 6 years ago
@rafaelgaspar thanks for the report. Are the panic log and the os-release coming from the same machine? The kernel trace shows 4.14.59-coreos-r2
but the os-release mentions 1800.7.0
which come with a 4.14.63
kernel. Can you please confirm that you are seeing this on both versions?
@lucab yes, it's from the same node, it probably updated with the crash, searching for more panic logs I saw that other nodes also crashed on 4.14.59-coreos-r2
and rebooted on 4.14.63-coreos
.
But some are still crashing on the newer kernel, here is a panic log from another node that was already running 4.14.63-coreos
after it updated during a previous kernel panic.
[41352.088481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[41352.096722] IP: xfrm_lookup+0x2f/0x900
[41352.100590] PGD 800000192438a067 P4D 800000192438a067 PUD 1924389067 PMD 0
[41352.107671] Oops: 0000 [#1] SMP PTI
[41352.111267] Modules linked in: seqiv xt_set ip_set_hash_net ip_set iptable_raw vxlan ip6_udp_tunnel udp_tunnel xt_nat xt_mark xfrm6_mode_tunnel xfrm4_mode_tunnel esp4 drbg veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay nls_ascii nls_cp437 vfat fat mousedev psmouse sb_edac i2c_piix4 edac_core i2c_core evdev pvpanic button sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio sd_mod btrfs xor zstd_decompress zstd_compress xxhash lzo_compress raid6_pq crc32c_intel aesni_intel aes_x86_64 virtio_scsi crypto_simd cryptd glue_helper scsi_mod nvme virtio_net nvme_core dm_mirror
[41352.182545] dm_region_hash dm_log dm_mod dax
[41352.187015] CPU: 9 PID: 6020 Comm: healthcheck Not tainted 4.14.63-coreos #1
[41352.194235] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[41352.203567] task: ffff934723cf3c00 task.stack: ffffb0ef07df4000
[41352.209607] RIP: 0010:xfrm_lookup+0x2f/0x900
[41352.213984] RSP: 0000:ffff9347d7443c48 EFLAGS: 00010246
[41352.219460] RAX: 0000000000000000 RBX: ffff9347d7443cc0 RCX: 0000000000000000
[41352.226785] RDX: ffff9347d7443cc0 RSI: 0000000000000000 RDI: ffff9347241dd600
[41352.234031] RBP: ffff9347241dd600 R08: 0000000000000002 R09: 0000000000000018
[41352.241271] R10: 0000000000000001 R11: 0000000079495078 R12: 0000000000000002
[41352.248512] R13: ffff9335ff11e89c R14: ffff9347241dd600 R15: 0000000000000000
[41352.255763] FS: 00007f49ba7fc700(0000) GS:ffff9347d7440000(0000) knlGS:0000000000000000
[41352.263958] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41352.269808] CR2: 0000000000000020 CR3: 000000192423e006 CR4: 00000000001606e0
[41352.277056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[41352.284393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[41352.291633] Call Trace:
[41352.294233] <IRQ>
[41352.296367] __xfrm_route_forward+0x8d/0xf0
[41352.300663] ip_forward+0x246/0x450
[41352.304260] ? ip_rcv_finish+0xac/0x400
[41352.308202] ip_rcv+0x287/0x3a0
[41352.311487] ? inet_del_offload+0x40/0x40
[41352.315612] __netif_receive_skb_core+0x432/0xb50
[41352.320421] ? process_backlog+0x97/0x150
[41352.324546] process_backlog+0x97/0x150
[41352.328489] net_rx_action+0x149/0x3d0
[41352.332347] __do_softirq+0xe8/0x2cf
[41352.336030] irq_exit+0xc2/0xd0
[41352.339276] do_IRQ+0x89/0xd0
[41352.342350] common_interrupt+0x7d/0x7d
[41352.346290] </IRQ>
[41352.348498] RIP: 0033:0x53a116
[41352.351656] RSP: 002b:000000c4203b5898 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffc8
[41352.359333] RAX: 000000c42005e120 RBX: 0000000000000034 RCX: 000000c420226000
[41352.366606] RDX: 0000000000944448 RSI: 0000000000100eda RDI: 0000000000000000
[41352.373850] RBP: 000000c4203b58b0 R08: 0000000000000000 R09: 0000000000000000
[41352.381087] R10: 0000000000000003 R11: 0000000000000199 R12: 0055555555555555
[41352.388329] R13: 00000000000000f5 R14: 0000000000000075 R15: 0000000000000000
[41352.395623] Code: 00 41 57 41 56 49 89 f7 41 55 41 54 49 89 fe 55 53 45 89 c4 48 89 d3 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 10 c7 44 24 0c 00 00 00 00 0f 84
[41352.414742] RIP: xfrm_lookup+0x2f/0x900 RSP: ffff9347d7443c48
[41352.420602] CR2: 0000000000000020
[41352.424036] ---[ end trace 5acadd3621d299c6 ]---
[41352.428768] Kernel panic - not syncing: Fatal exception in interrupt
[41352.436420] Kernel Offset: 0x37000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[41352.447318] Rebooting in 60 seconds..
[41412.558921] ACPI MEMORY or I/O RESET_REG.
@rafaelgaspar Are you still seeing this on current versions of Container Linux?
Issue Report
Kernel Panic on GCE with IPsec
Container Linux Version
Environment
Multiple machine types from 2 to 16 cores on Google Compute Engine(GCE), with and without local NVME SSDs, all Running Rancher with IPsec.
Expected Behavior
To respect Douglas Adams and do not panic.
Actual Behavior
Multiple kernel panics all across the cluster.
Other Information
Follows the panic: