amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
453 stars 174 forks source link

[Bug]: Unexpected kernel dump on 6.9.9 - g5.4xlarge #315

Open aledbf opened 1 month ago

aledbf commented 1 month ago

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

ena_linux_2.12.2

Custom Code

No

OS Platform and Distribution

Ubuntu Linux 24.04 - 6.9.9

Bug description


invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 15 PID: 396 Comm: systemd-network Tainted: G           O       6.9.9 #1
Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:ena_lpc_get_page+0xd7/0xe0 [ena]
Code: 24 83 43 04 01 48 83 87 f0 01 00 00 01 48 8b 43 10 48 83 c4 08 5b 5d e9 47 97 a0 c1 83 c0 01 31 d2 f7 f6 89 13 e9 70 ff ff ff <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900005c7308 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc9000052d000 RCX: 0000000000000800
RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888105a129e0
RBP: ffffc900005c7330 R08: ffffc9000052d010 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000040 R14: ffff888105a129e0 R15: 0000000000000001
FS:  00007f994e220440(0000) GS:ffff888fd3b80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f994e4e3320 CR3: 00000001020ea000 CR4: 00000000003506f0
Call Trace:
 <TASK>
 ? die+0x31/0x80
 ? do_trap+0xd8/0x100
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? do_error_trap+0x65/0x90
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? exc_invalid_op+0x51/0x70
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? asm_exc_invalid_op+0x1a/0x20
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? ena_lpc_get_page+0x92/0xe0 [ena]
 ena_refill_rx_bufs+0x84/0x240 [ena]
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 395 Comm: systemd-network Tainted: G           O       6.9.9 #1
Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:ena_lpc_get_page+0xd7/0xe0 [ena]
Code: 24 83 43 04 01 48 83 87 f0 01 00 00 01 48 8b 43 10 48 83 c4 08 5b 5d e9 47 c7 90 c1 83 c0 01 31 d2 f7 f6 89 13 e9 70 ff ff ff <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
nvidia: module license 'NVIDIA' taints kernel.
RSP: 0018:ffffc90000757310 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc900067ed000 RCX: 0000000000000800
RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8881029d29e0
RBP: ffffc90000757338 R08: ffffc900067ed010 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
nvidia: module license taints kernel.
R13: 0000000000000040 R14: ffff8881029d29e0 R15: 0000000000000001
FS:  00007ff6b8599440(0000) GS:ffff888fd3500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff6b82e3320 CR3: 00000001065f0000 CR4: 00000000003506f0
Call Trace:
 <TASK>
 ? die+0x31/0x80
 ? do_trap+0xd8/0x100
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? do_error_trap+0x65/0x90
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? exc_invalid_op+0x51/0x70
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ? asm_exc_invalid_op+0x1a/0x20
 ? ena_lpc_get_page+0xd7/0xe0 [ena]
 ena_refill_rx_bufs+0x84/0x240 [ena]
 ena_up+0xaaf/0xf20 [ena]
 __dev_open+0xf8/0x1b0
 __dev_change_flags+0x1be/0x240
 ? srso_return_thunk+0x5/0x5f
 ? __alloc_pages+0x21c/0x10a0
 dev_change_flags+0x21/0x60
 do_setlink+0xdf9/0x11d0
 ? __nla_reserve+0x3c/0x50
 ? srso_return_thunk+0x5/0x5f
 ? __nla_put+0x10/0x30
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? __nla_reserve+0x3c/0x50
 ? srso_return_thunk+0x5/0x5f
 ? __nla_validate_parse+0x61/0xd50
 ? rt6_fill_node.isra.0+0x30c/0x7f0 [ipv6]
 ? srso_return_thunk+0x5/0x5f
 rtnl_setlink+0xe2/0x170
 ? srso_return_thunk+0x5/0x5f
 ? generic_permission+0x34/0x220
 rtnetlink_rcv_msg+0x166/0x400
 ? srso_return_thunk+0x5/0x5f
 ? generic_permission+0x34/0x220
 ? inode_permission+0x38/0x170
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x56/0x100
 netlink_unicast+0x197/0x280
 netlink_sendmsg+0x1ee/0x450
 __sys_sendto+0x176/0x180
 ? srso_return_thunk+0x5/0x5f
 __x64_sys_sendto+0x1f/0x30
 do_syscall_64+0xc0/0x1a0
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? set_pte_range+0xea/0x2c0
 ? srso_return_thunk+0x5/0x5f
 ? next_uptodate_folio+0x94/0x260
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? filemap_map_pages+0x43c/0x520
 ? clear_inode+0x2a/0x80
 ? shmem_evict_inode+0x85/0x290
 ? srso_return_thunk+0x5/0x5f
 ? do_fault+0x239/0x450
 ? srso_return_thunk+0x5/0x5f
 ? __handle_mm_fault+0x704/0x890
 ? srso_return_thunk+0x5/0x5f
 ? __count_memcg_events+0x5b/0xd0
 ? srso_return_thunk+0x5/0x5f
 ? count_memcg_events.constprop.0+0x1a/0x30
 ? srso_return_thunk+0x5/0x5f
 ? handle_mm_fault+0xaa/0x2d0
 ? srso_return_thunk+0x5/0x5f
 ? do_user_addr_fault+0x311/0x630
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? irqentry_exit_to_user_mode+0x5a/0x110
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff6b7f2c0a7
Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d a5 ef 0d 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 69 c3 55 48 89 e5 53 48 83 ec 38 44 89 4d d0
RSP: 002b:00007ffd7105d328 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055d26fc35930 RCX: 00007ff6b7f2c0a7
RDX: 0000000000000020 RSI: 000055d26fc4f8e0 RDI: 0000000000000003
RBP: 00007ffd7105d3c0 R08: 00007ffd7105d330 R09: 0000000000000080
R10: 0000000000000000 R11: 0000000000000202 R12: 000055d26fc4a660
R13: 000055d26fc4aad8 R14: 0000000000000000 R15: 000055d26fc4aa90
 </TASK>
Modules linked in: intel_agp intel_gtt drm_kms_helper ena(O) agpgart nf_conntrack libcrc32c nf_defrag_ipv6 nf_defrag_ipv4 overlay br_netfilter bridge stp llc drm drm_panel_orientation_quirks ipv6
---[ end trace 0000000000000000 ]---
RIP: 0010:ena_lpc_get_page+0xd7/0xe0 [ena]
Code: 24 83 43 04 01 48 83 87 f0 01 00 00 01 48 8b 43 10 48 83 c4 08 5b 5d e9 47 c7 90 c1 83 c0 01 31 d2 f7 f6 89 13 e9 70 ff ff ff <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc90000757310 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc900067ed000 RCX: 0000000000000800
RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8881029d29e0
RBP: ffffc90000757338 R08: ffffc900067ed010 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000040 R14: ffff8881029d29e0 R15: 0000000000000001
FS:  00007ff6b8599440(0000) GS:ffff888fd3500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff6b82e3320 CR3: 00000001065f0000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
Rebooting in 10 seconds..

Reproduction steps

1. boot an EC2 instance with the kernel.

Expected Behavior

Normal boot

Actual Behavior

kernel dump - ena_lpc_get_page

Additional Data

No response

Relevant log output

No response

Contact Details

aledbf@gmail.com

akiyano commented 1 month ago

Hi @aledbf ,

Can you please give a bit more details on reproduction steps? The current default ubuntu ami comes with kernel 6.8.0-1009-aws.

  1. How did you update the kernel? did you build it yourself from source code?
  2. You also say that you see this call stack when booting. And you say that the driver version is ena_linux_2.12.2. So did you install the driver using dkms as instructed in https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/README.rst?

The more details on how you got to your setup start to finish the easier it will be for us to reproduce and provide support.

Thanks! Arthur

akiyano commented 1 month ago

Hi @aledbf , Please try the following patch and see if it solves your issue: 0001-workaround-lpc-flexible-array.patch

aledbf commented 1 month ago

@akiyano thank you for the quick response. Yes, the patch fixes the issue. 🙏

akiyano commented 1 month ago

Great, leaving issue open until we release the fix in one of the next driver versions.

davidarinzon commented 2 days ago

Hi @aledbf The issue is expected to be resolved with the latest version, https://github.com/amzn/amzn-drivers/releases/tag/ena_linux_2.13.0, please let us know if you still experience it.