awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

Kernel panic #812

Closed aweeks closed 2 years ago

aweeks commented 3 years ago

What happened:

Kernel panic while running 5.4.110-54.189.amzn2.x86_64:

[558161.617047] general protection fault: 0000 [#1] SMP PTI
[558161.620684] CPU: 46 PID: 13911 Comm: kubelet Not tainted 5.4.110-54.189.amzn2.x86_64 #1
[558161.626782] Hardware name: Amazon EC2 r5.12xlarge/, BIOS 1.0 10/16/2017
[558161.630944] RIP: 0010:string_nocheck+0xf/0x60
[558161.634169] Code: 00 48 89 ef e8 52 90 00 00 4c 01 e3 e9 7a ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 48 89 c8 49 89 f1 48 c1 f8 30 66 85 c0 74 42 <44> 0f b6 02 45 84 c0 74 39 83 e8 01 4c 8d 54 07 01 b8 01 00 00 00
[558161.645783] RSP: 0018:ffffc900013a7c80 EFLAGS: 00010086
[558161.649442] RAX: ffffffffffffffff RBX: ffff88d713610000 RCX: ffff0a00ffffff04
[558161.655181] RDX: 7430f4fc1055a52b RSI: ffff88d713610000 RDI: ffff88d71360f054
[558161.660982] RBP: 7430f4fc1055a52b R08: 0000000000000fac R09: ffff88d713610000
[558161.666795] R10: ffffc900013a7d80 R11: ffff88d71360f053 R12: ffff0a00ffffff04
[558161.672575] R13: 0000000000000fac R14: ffffc900013a7d18 R15: ffffffff81e8bfa5
[558161.678335] FS:  00007fa097fff700(0000) GS:ffff88ddbbd80000(0000) knlGS:0000000000000000
[558161.684445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[558161.688290] CR2: 00007fc4ad665fb8 CR3: 0000005d87bba004 CR4: 00000000007606e0
[558161.694156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[558161.699960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[558161.705780] PKRU: 55555554
[558161.708424] Call Trace:
[558161.711005]  string+0x40/0x50
[558161.713724]  vsnprintf+0x410/0x4d0
[558161.716599]  seq_vprintf+0x30/0x50
[558161.719523]  seq_printf+0x4e/0x70
[558161.722389]  __blkg_prfill_rwstat+0x5b/0xb0
[558161.725591]  blkg_prfill_rwstat_field+0x96/0xc0
[558161.728942]  ? blkg_prfill_rwstat+0xc0/0xc0
[558161.732197]  blkcg_print_blkgs+0x92/0xd0
[558161.735307]  blkg_print_stat_bytes+0x3f/0x50
[558161.738564]  seq_read+0xd8/0x400
[558161.741393]  vfs_read+0x89/0x130
[558161.744208]  ksys_read+0xa1/0xe0
[558161.747060]  do_syscall_64+0x4e/0x100
[558161.848177]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[558161.851946] RIP: 0033:0x49369b
[558161.855171] Code: fe ff eb bd e8 a6 e8 fd ff e9 61 ff ff ff cc e8 3b b2 fd ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[558161.870468] RSP: 002b:000000c00605d960 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[558161.877852] RAX: ffffffffffffffda RBX: 000000c00005f000 RCX: 000000000049369b
[558161.883559] RDX: 0000000000001000 RSI: 000000c006110000 RDI: 000000000000003c
[558161.889498] RBP: 000000c00605d9b0 R08: 0000000000000001 R09: 0000000000000002
[558161.895291] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[558161.901126] R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000002
[558161.906953] Modules linked in: ext4 crc16 mbcache jbd2 xt_REDIRECT xt_owner iptable_raw xt_CT rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_multiport veth xt_connmark nf_conntrack_netlink nfnetlink xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs dummy iptable_mangle xt_MASQUERADE xt_conntrack xt_comment xt_mark xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd mousedev cryptd ena glue_helper button psmouse evdev ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
[558161.945351] ---[ end trace 03f3e7a18186276e ]---
[558161.948766] RIP: 0010:string_nocheck+0xf/0x60
[558161.952064] Code: 00 48 89 ef e8 52 90 00 00 4c 01 e3 e9 7a ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 48 89 c8 49 89 f1 48 c1 f8 30 66 85 c0 74 42 <44> 0f b6 02 45 84 c0 74 39 83 e8 01 4c 8d 54 07 01 b8 01 00 00 00
[558161.963654] RSP: 0018:ffffc900013a7c80 EFLAGS: 00010086
[558161.967272] RAX: ffffffffffffffff RBX: ffff88d713610000 RCX: ffff0a00ffffff04
[558161.973072] RDX: 7430f4fc1055a52b RSI: ffff88d713610000 RDI: ffff88d71360f054
[558161.978950] RBP: 7430f4fc1055a52b R08: 0000000000000fac R09: ffff88d713610000
[558161.984721] R10: ffffc900013a7d80 R11: ffff88d71360f053 R12: ffff0a00ffffff04
[558161.990598] R13: 0000000000000fac R14: ffffc900013a7d18 R15: ffffffff81e8bfa5
[558161.996394] FS:  00007fa097fff700(0000) GS:ffff88ddbbd80000(0000) knlGS:0000000000000000
[558162.002560] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[558162.006382] CR2: 00007fc4ad665fb8 CR3: 0000005d87bba004 CR4: 00000000007606e0
[558162.012248] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[558162.018036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[558162.023806] PKRU: 55555554
[558162.026464] Kernel panic - not syncing: Fatal exception
[558162.030582] Kernel Offset: disabled
[558162.033482] Rebooting in 10 seconds..

Per the kernel oops above, it appears that a syscall from kubelet led to the panic, but beyond that I don't have much more insight.

After digging into the stack a little bit more, one possibility is that dname (returned by blkg_dev_name()) was a bad pointer: link. When later deferenced as part of the %s formatting in seq_printf, it could have generated the memory protection fault.

Interestingly, it would not have been a null pointer, as that is explicitly checked here.

I looked through the Kernel bugtracker, and was not able to find any bugs that seemed related.

What you expected to happen:

No kernel panic.

How to reproduce it (as minimally and precisely as possible):

I unfortunately do not have a repro—this has only occurred once in our clusters.

Anything else we need to know?:

Environment:

cartermckinnon commented 2 years ago

@aweeks have you observed this on any newer kernel version, or more than once on 5.4.110-54.189.amzn2.x86_64?

cartermckinnon commented 2 years ago

Without steps to reproduce, there's probably not much we can do here. There have been many revisions to the kernel and several to the kubelet since this occurred; so please update to the latest AMI and create a new issue (referencing this one) if you observe this again.