amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
501 stars 38 forks source link

[Bug] - Kernel 6.1.41-63.114.amzn2023.x86_64 NULL pointer dereference #537

Closed gamedebug closed 8 months ago

gamedebug commented 8 months ago

Describe the bug instance restart, kernel null pointer dereference.

Info:

Instance type: g5.2xlarge AMI: Amazon Linux 2023 AMI 2023.1.20230825.0 x86_64 HVM kernel-6.1 Current Kernel version: 6.1.41-63.114.amzn2023.x86_64

Instance console:

Amazon Linux 2023
Kernel 6.1.41-63.114.amzn2023.x86_64 on an x86_64 (-)

ip-172-31-78-53 login: [69963.965912] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[69964.000859] nvidia-uvm: Loaded the UVM driver, major device number 244.
[474945.793180] systemd-journald[1183]: Data hash table of /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal has a fill level at 75.0 (174764 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
[474945.795401] systemd-journald[1183]: /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal: Journal header limits reached or header out-of-date, rotating.
[912218.275502] systemd-journald[1183]: Data hash table of /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal has a fill level at 75.0 (174764 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
[912218.277881] systemd-journald[1183]: /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal: Journal header limits reached or header out-of-date, rotating.
[1387769.774140] systemd-journald[1183]: Data hash table of /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal has a fill level at 75.0 (174765 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
[1387769.776531] systemd-journald[1183]: /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal: Journal header limits reached or header out-of-date, rotating.
[1744943.756442] systemd-journald[1183]: Data hash table of /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal has a fill level at 75.0 (174764 of 233016 items, 58720256 file size, 335 bytes per hash table item), suggesting rotation.
[1744943.758771] systemd-journald[1183]: /var/log/journal/ec28b77d9053d8c84fe096ba454ef598/system.journal: Journal header limits reached or header out-of-date, rotating.
[1814490.635745] BUG: kernel NULL pointer dereference, address: 00000000000000f6
[1814490.636578] #PF: supervisor read access in kernel mode
[1814490.637194] #PF: error_code(0x0000) - not-present page
[1814490.637806] PGD 0 P4D 0 
[1814490.638133] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1814490.638657] CPU: 2 PID: 932906 Comm: systemd-userwor Tainted: P           OE      6.1.41-63.114.amzn2023.x86_64 #1
[1814490.639822] Hardware name: Amazon EC2 g5.2xlarge/, BIOS 1.0 10/16/2017
[1814490.640573] RIP: 0010:__filemap_get_folio+0xab/0x330
[1814490.641171] Code: 10 e8 e9 59 75 00 49 89 c4 48 3d 06 04 00 00 74 e2 48 3d 02 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
[1814490.643220] RSP: 0000:ffff9eec8fd07c50 EFLAGS: 00010246
[1814490.643827] RAX: 00000000000000c2 RBX: 0000000000000000 RCX: 0000000000000002
[1814490.644645] RDX: 000000000000003c RSI: ffff88b007d2eb58 RDI: ffff9eec8fd07c60
[1814490.645471] RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000000003f
[1814490.646285] R10: 0000000000000040 R11: 000000000000003c R12: 00000000000000c2
[1814490.647124] R13: ffff88b007d10eb0 R14: 000000000000003e R15: ffff88b007d10eb0
[1814490.647941] FS:  00007fa7c9421b40(0000) GS:ffff88b708a80000(0000) knlGS:0000000000000000
[1814490.648859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1814490.649530] CR2: 00000000000000f6 CR3: 00000006f01a4000 CR4: 00000000003506e0
[1814490.650352] Call Trace:
[1814490.650669]  <TASK>
[1814490.650947]  ? show_trace_log_lvl+0x1c4/0x2d2
[1814490.651470]  ? show_trace_log_lvl+0x1c4/0x2d2
[1814490.651993]  ? filemap_fault+0x64/0x7a0
[1814490.652457]  ? __die_body.cold+0x8/0xd
[1814490.652912]  ? page_fault_oops+0xac/0x150
[1814490.653390]  ? do_user_addr_fault+0x61/0x5a0
[1814490.653903]  ? exc_page_fault+0x62/0x140
[1814490.654376]  ? asm_exc_page_fault+0x22/0x30
[1814490.654895]  ? __filemap_get_folio+0xab/0x330
[1814490.655426]  filemap_fault+0x64/0x7a0
[1814490.655873]  ? preempt_count_add+0x4d/0xa0
[1814490.656374]  __do_fault+0x33/0x190
[1814490.656800]  do_read_fault+0xea/0x190
[1814490.657269]  do_fault+0xbe/0x4a0
[1814490.657671]  __handle_mm_fault+0x513/0x5e0
[1814490.658172]  handle_mm_fault+0xc5/0x2b0
[1814490.658638]  do_user_addr_fault+0x1af/0x5a0
[1814490.659155]  exc_page_fault+0x62/0x140
[1814490.659614]  asm_exc_page_fault+0x22/0x30
[1814490.660101] RIP: 0033:0x7fa7c9ae13a9
[1814490.660544] Code: 7c 24 38 e8 f9 5b ff ff 85 c0 0f 85 b1 00 00 00 49 8b 84 24 48 03 00 00 4d 85 ed 74 54 48 85 c0 0f 84 eb 00 00 00 8b 54 24 30 <44> 0f b7 34 50 4c 89 f0 25 ff 7f 00 00 48 8d 14 40 49 8b 84 24 e8
[1814490.662619] RSP: 002b:00007ffedf878560 EFLAGS: 00010206
[1814490.663237] RAX: 00007fa7c963cd42 RBX: 00007fa7c9618078 RCX: 0000000000000002
[1814490.664060] RDX: 00000000000009d4 RSI: 00007fa7c9618078 RDI: 00007fa7c9625fa9
[1814490.664910] RBP: 0000000000000009 R08: 0000000000000000 R09: 00007fa7c9618078
[1814490.665748] R10: 00007fa7c9625fa9 R11: 00007fa7c9609498 R12: 00007fa7c9ad10b0
[1814490.666574] R13: 00007fa7c942e8c0 R14: 00007fa7c9607230 R15: 00007fa7c942e8c0
[1814490.667407]  </TASK>
[1814490.667703] Modules linked in: nvidia_uvm(POE) sunrpc nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ghash_clmulni_intel video wmi drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea aesni_intel fb crypto_simd fbdev cryptd button ena sch_fq_codel drm i2c_core fuse drm_panel_orientation_quirks configfs backlight dmi_sysfs crc32_pclmul crc32c_intel efivarfs
[1814490.671595] CR2: 00000000000000f6
[1814490.672005] ---[ end trace 0000000000000000 ]---
[1814490.771019] RIP: 0010:__filemap_get_folio+0xab/0x330
[1814490.771872] Code: 10 e8 e9 59 75 00 49 89 c4 48 3d 06 04 00 00 74 e2 48 3d 02 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
[1814490.773946] RSP: 0000:ffff9eec8fd07c50 EFLAGS: 00010246
[1814490.774550] RAX: 00000000000000c2 RBX: 0000000000000000 RCX: 0000000000000002
[1814490.775355] RDX: 000000000000003c RSI: ffff88b007d2eb58 RDI: ffff9eec8fd07c60
[1814490.776162] RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000000003f
[1814490.776982] R10: 0000000000000040 R11: 000000000000003c R12: 00000000000000c2
[1814490.777800] R13: ffff88b007d10eb0 R14: 000000000000003e R15: ffff88b007d10eb0
[1814490.778629] FS:  00007fa7c9421b40(0000) GS:ffff88b708a80000(0000) knlGS:0000000000000000
[1814490.779543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1814490.780237] CR2: 00000000000000f6 CR3: 00000006f01a4000 CR4: 00000000003506e0
[1814490.781058] Kernel panic - not syncing: Fatal exception
[1814490.782605] Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[1814490.886153] Rebooting in 5 seconds..
stewartsmith commented 8 months ago

Since you have the NVIDIA driver loaded, you're going to have to contact NVIDIA for this issue as since the NVIDIA driver is not open source software, we are unable to debug it.

If you're able to reproduce without the NVIDIA driver loaded, we can take a look.