Open anh0516 opened 4 months ago
If it does not happen on 6.7 but it does with 6.8, are you able to bisect to see what change introduced this? The address it is faulting at seems rather suspect (00000000ffffffff
), what is your CONFIG_INIT_STACK_
value?
Instruction pointer is in strcmp, called from snd_byt_rt5640_mc_probe. Perhaps there's a bug somewhere near there?
FWIW:
./scripts/decodecode < /tmp/x
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f b6 14 0e 88 14 08 48 ff c1 84 d2 75 f2 c3 cc 31 c0 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> b6 0c 07 0f b6 14 06 38 d1 75 0a 48 ff c0 84 c9 75 ed 31 c0 c3
All code
========
0: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
7: 00 00 00
a: 0f b6 14 0e movzbl (%rsi,%rcx,1),%edx
e: 88 14 08 mov %dl,(%rax,%rcx,1)
11: 48 ff c1 inc %rcx
14: 84 d2 test %dl,%dl
16: 75 f2 jne 0xa
18: c3 ret
19: cc int3
1a: 31 c0 xor %eax,%eax
1c: 66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
23: 1f 84 00 00 00 00 00
2a:* 0f b6 0c 07 movzbl (%rdi,%rax,1),%ecx <-- trapping instruction
2e: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
32: 38 d1 cmp %dl,%cl
34: 75 0a jne 0x40
36: 48 ff c0 inc %rax
39: 84 c9 test %cl,%cl
3b: 75 ed jne 0x2a
3d: 31 c0 xor %eax,%eax
3f: c3 ret
Code starting with the faulting instruction
===========================================
0: 0f b6 0c 07 movzbl (%rdi,%rax,1),%ecx
4: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
8: 38 d1 cmp %dl,%cl
a: 75 0a jne 0x16
c: 48 ff c0 inc %rax
f: 84 c9 test %cl,%cl
11: 75 ed jne 0x0
13: 31 c0 xor %eax,%eax
15: c3 ret
strcmp
assumes the parameters are not NULL
. Are we sure that byt_rt5640_dais[i].codecs->name
is never NULL
?
I'm using CONFIG_INIT_STACK_NONE, as part of getting what little performance I can out of the Atom CPU of the tablet in question. I'll try INIT_STACK_ALL_ZERO when I get the chance. I attached the kernel config in the original post.
On Wed, Apr 17, 2024, 1:36 PM Nathan Chancellor @.***> wrote:
If it does not happen on 6.7 but it does with 6.8, are you able to bisect to see what change introduced this? The address it is faulting at seems rather suspect (00000000ffffffff), what is your CONFIG_INITSTACK value?
— Reply to this email directly, view it on GitHub https://github.com/ClangBuiltLinux/linux/issues/2017#issuecomment-2061848315, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSSZDDEC7R3LU4IEVSBXTLY52XJZAVCNFSM6AAAAABGLX6ZDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DQMZRGU . You are receiving this because you authored the thread.Message ID: @.***>
@nathanchance same failure with INIT_STACK_ZERO:
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 204 Comm: (udev-worker) Tainted: G C 6.8.6-llvm #3
Hardware name: Dell Inc. Venue 8 Pro 5830/09RP78, BIOS A16 02/27/2018
RIP: 0010:strcmp+0x10/0x30
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f b6 14 0e 88 14 08 48 ff c1 84 d2 75 f2 c3 cc 31 c0 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> b6 0c 07 0f b6 14 06 38 d1 75 0a 48 ff c0 84 c9 75 ed 31 c0 c3
RSP: 0018:ffffc9000056b748 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8880058a1a28 RCX: ffff88800592d2a8
RDX: ffff88800592d2a8 RSI: ffffffffa09c2388 RDI: 00000000ffffffff
RBP: ffffc9000056b898 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8880058a1a00 R11: ffffffffa085003f R12: ffff88800a9b6300
R13: ffff88800592d000 R14: ffffffffa08d10a8 R15: ffff88800592d010
FS: 00007fa0c9f79540(0000) GS:ffff888076880000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000ffffffff CR3: 0000000005bc7000 CR4: 00000000001006f0
Call Trace:
<TASK>
? __die_body+0x6d/0x120
? page_fault_oops+0x3ce/0x4c0
? kernelmode_fixup_or_oops+0xf4/0x130
? exc_page_fault+0x47/0x70
? asm_exc_page_fault+0x22/0x30
? snd_byt_rt5640_mc_probe+0x3f/0xcb0 [snd_soc_sst_bytcr_rt5640]
? strcmp+0x10/0x30
snd_byt_rt5640_mc_probe+0x70/0xcb0 [snd_soc_sst_bytcr_rt5640]
? rwsem_down_write_slowpath+0x9d/0x600
? try_to_wake_up+0x308/0x370
? rwsem_wake+0x92/0xf0
? kernfs_activate+0x1e8/0x200
? kernfs_add_one+0x2c0/0x360
? 0xffffffffa0850000
platform_probe+0x48/0xb0
really_probe+0x1ce/0x400
__driver_probe_device+0x146/0x250
driver_probe_device+0x1e/0x240
__driver_attach+0x13f/0x2f0
? driver_attach+0x20/0x20
bus_for_each_dev+0x141/0x1e0
bus_add_driver+0x1e5/0x2f0
driver_register+0x71/0x170
? 0xffffffffa08bf000
do_one_initcall+0x130/0x3a0
? try_to_wake_up+0x308/0x370
? rwsem_wake+0x92/0xf0
? kernfs_activate+0x1e8/0x200
? sched_clock+0xc/0x20
? rwsem_down_write_slowpath+0xc6/0x600
? try_to_wake_up+0x308/0x370
? rwsem_wake+0x92/0xf0
? kernfs_activate+0x1e8/0x200
? kernfs_add_one+0x2c0/0x360
? __kernfs_create_file+0xa9/0xe0
? sysfs_create_bin_file+0xc4/0x100
? kobject_create_and_add+0x72/0xd0
? add_notes_attrs+0x190/0x200
? __slab_free+0x7b/0x2e0
? load_module+0x18d3/0x1b30
do_init_module+0x65/0x480
__se_sys_finit_module+0x332/0x4a0
do_syscall_64+0x65/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fa0cab4915d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 ab 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffce60b7818 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000563062474100 RCX: 00007fa0cab4915d
RDX: 0000000000000000 RSI: 00007fa0cac682f0 RDI: 000000000000001e
RBP: 00007fa0cac682f0 R08: 0000000000000001 R09: 0000563062478a40
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000020000
R13: 0000563062468910 R14: 0000563062474d60 R15: 0000563062478c80
</TASK>
Modules linked in: snd_soc_sst_bytcr_rt5640(+) intel_soc_dts_thermal mousedev intel_soc_dts_iosf intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul gpio_keys ath6kl_sdio(+) hid_multitouch polyval_generic ath6kl_core hid_sensor_hub gf128mul ghash_clmulni_intel cfg80211 snd_sof_acpi_intel_byt snd_sof_xtensa_dsp snd_sof_acpi rfkill aesni_intel snd_sof_intel_atom crypto_simd snd_sof cryptd snd_sof_utils intel_cstate i915 snd_intel_sst_acpi snd_soc_acpi_intel_match snd_intel_sst_core snd_soc_sst_atom_hifi2_platform intel_bytcrc_pwrsrc snd_soc_rt5640 snd_intel_dspcfg snd_soc_rl6231 snd_soc_acpi int3401_thermal snd_soc_core i2c_algo_bit processor_thermal_device snd_compress processor_thermal_power_floor processor_thermal_wt_hint processor_thermal_wt_req ov5693 drm_display_helper atomisp_mt9m114(C) snd_pcm processor_thermal_rfim v4l2_fwnode atomisp_gmin_platform(C) int3406_thermal v4l2_async drm_buddy video processor_thermal_mbox snd_timer wmi processor_thermal_rapl intel_gtt snd soc_button_array backlight
dptf_power ttm intel_rapl_common videodev vfat int3400_thermal fat int3403_thermal int340x_thermal_zone acpi_thermal_rel cec mc v4l2_cci soundcore intel_int0002_vgpio i2c_hid_acpi i2c_hid 8250_dw pwm_lpss_platform pwm_lpss crypto_user mmc_block xhci_pci xhci_pci_renesas xhci_hcd usbcore usb_common sdhci_acpi sdhci mmc_core spi_pxa2xx_platform ext4 crc32c_generic crc32c_intel mbcache crc16 jbd2
CR2: 00000000ffffffff
---[ end trace 0000000000000000 ]---
RIP: 0010:strcmp+0x10/0x30
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f b6 14 0e 88 14 08 48 ff c1 84 d2 75 f2 c3 cc 31 c0 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> b6 0c 07 0f b6 14 06 38 d1 75 0a 48 ff c0 84 c9 75 ed 31 c0 c3
RSP: 0018:ffffc9000056b748 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8880058a1a28 RCX: ffff88800592d2a8
RDX: ffff88800592d2a8 RSI: ffffffffa09c2388 RDI: 00000000ffffffff
RBP: ffffc9000056b898 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8880058a1a00 R11: ffffffffa085003f R12: ffff88800a9b6300
R13: ffff88800592d000 R14: ffffffffa08d10a8 R15: ffff88800592d010
FS: 00007fa0c9f79540(0000) GS:ffff888076880000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000ffffffff CR3: 0000000005bc7000 CR4: 00000000001006f0```
I was not necessarily expecting the INIT_STACK configuration to really matter but thank you for checking!
I noticed https://git.kernel.org/linus/7d99a70b65951108d82e1618c67abe69c3ed7720 in the list of changes from 6.7 to 6.8, which seems potentially relevant here since it mentions fixing a strcmp()
crash due to a 6.8 change. It just seems like name
is an invalid value, not NULL
. How does LTO affect that, since this presumably does not happen without LTO?
@nathanchance reverting that change did not help, sadly. The failed machine code reported from the BUG() is exactly the same, too, so the failure must be elsewhere. Where did you get the list of changes? kernelnewbies.org hasn't been updated for 6.8 yet and I can't find it anywhere else. (I'm new to kernel debugging.)
I noticed https://git.kernel.org/linus/7d99a70b65951108d82e1618c67abe69c3ed7720 It just seems like name is an invalid value, not NULL.
The name
field may have a NULL
value. snd_byt_rt5640_mc_probe
is not checking for that before calling strcmp
! If strcmp is inlined due to LTO, and it doesn't check for NULL
(the one in lib/string.c doesn't and isn't required to) then perhaps if byt_rt5640_dais[i].codecs->name
is determined to be NULL at compile time (loop unroll + cross TU inlining) then LLVM will start removing code due to UB.
EDIT: Nvm my tree was out of date. Perhaps time to break out ubsan?
I enabled UBSAN but it didn't catch anything, only two probably unrelated array index out of bounds in net/wireless/nl80211.c.
I'll put them here anyways, though, just in case:
UBSAN: array-index-out-of-bounds in net/wireless/nl80211.c:9203:29
index 47 is out of range for type 'struct ieee80211_channel *[]'
CPU: 1 PID: 307 Comm: wpa_supplicant Tainted: G D C 6.8.7-llvm #1
Hardware name: Dell Inc. Venue 8 Pro 5830/09RP78, BIOS A16 02/27/2018
Call Trace:
<TASK>
__ubsan_handle_out_of_bounds+0xdd/0x140
nl80211_trigger_scan+0xaf2/0xc00 [cfg80211]
? genl_family_rcv_msg_attrs_parse+0x9d/0xc0
genl_family_rcv_msg_doit+0xb4/0xf0
genl_rcv_msg+0x226/0x240
? genlmsg_multicast_netns+0x40/0x40 [cfg80211]
? nl80211_update_mesh_config+0xf0/0xf0 [cfg80211]
? nl80211_pre_doit+0x320/0x320 [cfg80211]
? genl_release+0x260/0x260
netlink_rcv_skb+0x74/0x100
genl_rcv+0x1f/0x80
netlink_unicast+0x2cf/0x520
netlink_sendmsg+0x492/0x5c0
____sys_sendmsg+0x1a9/0x260
__sys_sendmsg+0x2c9/0x330
do_syscall_64+0x65/0x130
? exc_page_fault+0x47/0x70
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fbfb78c1c84
Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d c5 e3 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89
RSP: 002b:00007fffc6d09bc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000055c1408e79d0 RCX: 00007fbfb78c1c84
RDX: 0000000000000000 RSI: 00007fffc6d09c00 RDI: 0000000000000006
RBP: 000055c14095de80 R08: 0000000000000004 R09: 00000000000000f0
R10: 00007fffc6d09cd4 R11: 0000000000000202 R12: 000055c1408e7cb0
R13: 00007fffc6d09c00 R14: 0000000000000000 R15: 00007fffc6d09cd4
</TASK>
---[ end trace ]---
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/wireless/nl80211.c:9252:5
index 0 is out of range for type 'struct ieee80211_channel *[]'
CPU: 1 PID: 307 Comm: wpa_supplicant Tainted: G D C 6.8.7-llvm #1
Hardware name: Dell Inc. Venue 8 Pro 5830/09RP78, BIOS A16 02/27/2018
Call Trace:
<TASK>
__ubsan_handle_out_of_bounds+0xdd/0x140
nl80211_trigger_scan+0x3a6/0xc00 [cfg80211]
? genl_family_rcv_msg_attrs_parse+0x9d/0xc0
genl_family_rcv_msg_doit+0xb4/0xf0
genl_rcv_msg+0x226/0x240
? genlmsg_multicast_netns+0x40/0x40 [cfg80211]
? nl80211_update_mesh_config+0xf0/0xf0 [cfg80211]
? nl80211_pre_doit+0x320/0x320 [cfg80211]
? genl_release+0x260/0x260
netlink_rcv_skb+0x74/0x100
genl_rcv+0x1f/0x80
netlink_unicast+0x2cf/0x520
netlink_sendmsg+0x492/0x5c0
____sys_sendmsg+0x1a9/0x260
__sys_sendmsg+0x2c9/0x330
do_syscall_64+0x65/0x130
? exc_page_fault+0x47/0x70
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fbfb78c1c84
Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d c5 e3 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89
RSP: 002b:00007fffc6d09bc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000055c1408e79d0 RCX: 00007fbfb78c1c84
RDX: 0000000000000000 RSI: 00007fffc6d09c00 RDI: 0000000000000006
RBP: 000055c14095de80 R08: 0000000000000004 R09: 00000000000000f0
R10: 00007fffc6d09cd4 R11: 0000000000000202 R12: 000055c1408e7cb0
R13: 00007fffc6d09c00 R14: 0000000000000000 R15: 00007fffc6d09cd4
</TASK>
---[ end trace ]---
systemd-journald[159]: /var/log/journal/ec011dcc411f49bf97b2f78b4fbfd5a6/user-1000.journal: Journal file uses a different sequence number ID, rotating.
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/wireless/nl80211.c:9232:4
index 0 is out of range for type 'struct ieee80211_channel *[]'
CPU: 3 PID: 307 Comm: wpa_supplicant Tainted: G D C 6.8.7-llvm #1
Hardware name: Dell Inc. Venue 8 Pro 5830/09RP78, BIOS A16 02/27/2018
Call Trace:
<TASK>
__ubsan_handle_out_of_bounds+0xdd/0x140
nl80211_trigger_scan+0x498/0xc00 [cfg80211]
? genl_family_rcv_msg_attrs_parse+0x9d/0xc0
genl_family_rcv_msg_doit+0xb4/0xf0
genl_rcv_msg+0x226/0x240
? genlmsg_multicast_netns+0x40/0x40 [cfg80211]
? nl80211_update_mesh_config+0xf0/0xf0 [cfg80211]
? nl80211_pre_doit+0x320/0x320 [cfg80211]
? genl_release+0x260/0x260
netlink_rcv_skb+0x74/0x100
genl_rcv+0x1f/0x80
netlink_unicast+0x2cf/0x520
netlink_sendmsg+0x492/0x5c0
____sys_sendmsg+0x1a9/0x260
__sys_sendmsg+0x2c9/0x330
do_syscall_64+0x65/0x130
? exc_page_fault+0x47/0x70
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fbfb78c1c84
Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d c5 e3 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89
RSP: 002b:00007fffc6d09bc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000055c1408e79d0 RCX: 00007fbfb78c1c84
RDX: 0000000000000000 RSI: 00007fffc6d09c00 RDI: 0000000000000006
RBP: 000055c1409621c0 R08: 0000000000000004 R09: 00000000000000f0
R10: 00007fffc6d09cd4 R11: 0000000000000202 R12: 000055c1408e7cb0
R13: 00007fffc6d09c00 R14: 0000000000000000 R15: 00007fffc6d09cd4
</TASK>
---[ end trace ]---
It probably has to do with this which is output right before: ath6kl: Firmware lacks RSN-CAP-OVERRIDE, so HT (802.11n) is disabled.
I did not enable any other debugging features and left UBSAN at its defaults. What else would you recommend I turn on?
Probably worth notifying the maintainers of those drivers, but sounds orthogonal to the issue being tracked here. (If you use triple backticks in GitHub markdown to open and close your trace, it will retain the original line wrapping). Perhaps worth testing ASAN, too. Tough IIRC ASAN is incompatible with LTO. Did you verify you don't observe this without LTO?
Perhaps worth testing ASAN, too. Tough IIRC ASAN is incompatible with LTO.
I think KASAN is now allowed with LTO: https://git.kernel.org/linus/349fde599db65d4827820ef6553e3f9ee75b8c7c
Something that occurred to me is LTO may have inlined some other function that calls strcmp()
into snd_byt_rt5640_mc_probe()
, which won't be entirely obvious from the stack trace.
Can you try running your stack trace through scripts/decode_stacktrace.sh
? Assuming the stack trace is saved in crash.log
:
$ LLVM=1 scripts/decode_stacktrace.sh vmlinux <crash.log
and see if that gives us any other idea what is going on here?
On a Dell Venue 8 Pro 5830, Linux 6.8 with Clang and LTO triggers a BUG() and the rt5640 audio driver to fail. This happens with the older driver and with the newer SOF driver forced with
snd-intel-dspcfg.dsp_driver=3
on the kernel command line. It was fine on 6.7, but there were changes to the rt5640 driver in 6.8 that seem to have broken things. This happens with both clang 17 and 18, built from Arch Linux and from my Gentoo box, so that eliminates quite a few variables.Additionally, the system hangs on the final stages of reboot or poweroff. I don't know if it is related to this or a separate issue.
Here is my kernel configuration: config-6.8.6-lto.txt
Being that this is a pretty uncommon piece of hardware, I am happy to test patches.