Closed thor2002ro closed 3 years ago
It might have a reason why this change is happening in the kernel here: https://elixir.bootlin.com/linux/v5.14-rc6/source/include/linux/sched.h#L669
Fix is to replace with:
(short int) thread->__state;
It sounds like 5.14 is preparing to remove that field !
@thor2002ro : please switch to develop
branch to get the fix.
I don't think its enough. :)
EDIT in Markdown
Aug 24 20:42:28 Tower kernel: BUG: unable to handle page fault for address: ffffffffa0d41728 Aug 24 20:42:28 Tower kernel: #PF: supervisor write access in kernel mode Aug 24 20:42:28 Tower kernel: #PF: error_code(0x0003) - permissions violation Aug 24 20:42:28 Tower kernel: PGD 280e067 P4D 280e067 PUD 280f063 PMD 13dcca067 PTE 144cf5161 Aug 24 20:42:28 Tower kernel: Oops: 0003 [#1] SMP NOPTI Aug 24 20:42:28 Tower kernel: CPU: 1 PID: 32589 Comm: modprobe Tainted: G W O 5.14.0-rc6-thor-Unraid #10 Aug 24 20:42:28 Tower kernel: Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS GAMING 7 WIFI/X470 AORUS GAMING 7 WIFI-CF, BIOS F61b 04/01/2021 Aug 24 20:42:28 Tower kernel: RIP: 0010:Compute_AMD_Zen_Boost+0x33/0x8d0 [corefreqk] Aug 24 20:42:28 Tower kernel: Code: 56 41 55 41 54 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 41 89 ff 48 8b 05 cd 62 02 00 48 8b 40 20 4a 8b 04 f8 80 f8 08 00 00 00 00 00 00 48 8b 05 b4 62 02 00 48 8b 40 20 4a Aug 24 20:42:28 Tower kernel: RSP: 0018:ffffc90000bcb9c0 EFLAGS: 00010286 Aug 24 20:42:28 Tower kernel: RAX: ffffffffa0d40e30 RBX: 0000001000000000 RCX: 00000000436f7265 Aug 24 20:42:28 Tower kernel: RDX: 0000000000000000 RSI: ffff8885e7d27100 RDI: 0000000000000000 Aug 24 20:42:28 Tower kernel: RBP: ffffc90000bcba80 R08: 0000000000100002 R09: ffff888102f42000 Aug 24 20:42:28 Tower kernel: R10: ffffffffa0d40e30 R11: ffffffffa0d42a60 R12: ffff888102f4219c Aug 24 20:42:28 Tower kernel: R13: ffff8881cbc53c00 R14: 0000000000000000 R15: 0000000000000000 Aug 24 20:42:28 Tower kernel: FS: 0000153549d66740(0000) GS:ffff88882e840000(0000) knlGS:0000000000000000 Aug 24 20:42:28 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 24 20:42:28 Tower kernel: CR2: ffffffffa0d41728 CR3: 00000007455ec000 CR4: 0000000000750ee0 Aug 24 20:42:28 Tower kernel: PKRU: 55555554 Aug 24 20:42:28 Tower kernel: Call Trace: Aug 24 20:42:28 Tower kernel: Query_AMD_Family_17h+0x2c9/0x850 [corefreqk] Aug 24 20:42:28 Tower kernel: Controller_Init+0xb6/0x620 [corefreqk] Aug 24 20:42:28 Tower kernel: CoreFreqK_Ignition_Level_Up+0x3b5/0xb20 [corefreqk] Aug 24 20:42:28 Tower kernel: init_module+0x44/0x1000 [corefreqk] Aug 24 20:42:28 Tower kernel: ? 0xffffffffa0cc8000 Aug 24 20:42:28 Tower kernel: do_one_initcall+0xbe/0x230 Aug 24 20:42:28 Tower kernel: ? idr_alloc_cyclic+0x17b/0x220 Aug 24 20:42:28 Tower kernel: ? __vunmap+0x208/0x4d0 Aug 24 20:42:28 Tower kernel: ? kfree+0x1aa/0x310 Aug 24 20:42:28 Tower kernel: ? free_unref_page+0xca/0x1f0 Aug 24 20:42:28 Tower kernel: ? do_init_module+0x25/0x370 Aug 24 20:42:28 Tower kernel: ? kmem_cache_alloc_trace+0x19e/0x430 Aug 24 20:42:28 Tower kernel: do_init_module+0x5b/0x370 Aug 24 20:42:28 Tower kernel: load_module+0x4599/0x50e0 Aug 24 20:42:28 Tower kernel: __se_sys_init_module+0x216/0x260 Aug 24 20:42:28 Tower kernel: do_syscall_64+0x41/0xc0 Aug 24 20:42:28 Tower kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Aug 24 20:42:28 Tower kernel: RIP: 0033:0x153549ea909a Aug 24 20:42:28 Tower kernel: Code: 48 8b 0d f9 7d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c6 7d 0c 00 f7 d8 64 89 01 48 Aug 24 20:42:28 Tower kernel: RSP: 002b:00007fff0c036838 EFLAGS: 00000202 ORIG_RAX: 00000000000000af Aug 24 20:42:28 Tower kernel: RAX: ffffffffffffffda RBX: 0000000000428080 RCX: 0000153549ea909a Aug 24 20:42:28 Tower kernel: RDX: 000000000041c368 RSI: 00000000000b1ff8 RDI: 00001535493c9010 Aug 24 20:42:28 Tower kernel: RBP: 00001535493c9010 R08: 000000000042701a R09: 0000000000000001 Aug 24 20:42:28 Tower kernel: R10: 0000000000427010 R11: 0000000000000202 R12: 000000000041c368 Aug 24 20:42:28 Tower kernel: R13: 0000000000000000 R14: 00000000004281b0 R15: 0000000000428080 Aug 24 20:42:28 Tower kernel: Modules linked in: corefreqk(O+) xt_nat xt_tcpudp nf_tables macvlan vhost_net tun vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter md_mod vendor_reset(O) amdgpu mfd_core drm_ttm_helper ttm gpu_sched drm_kms_helper drm backlight agpgart sysimgblt syscopyarea sysfillrect fb_sys_fops it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables edac_mce_amd kvm_amd kvm wmi_bmof gigabyte_wmi mxm_wmi btusb btrtl btbcm btintel ghash_clmulni_intel aesni_intel bluetooth igb crypto_simd cryptd corsair_psu input_leds ahci ecdh_generic i2c_piix4 i2c_algo_bit i2c_core led_class rapl ecc wmi ccp libahci button zenpower(O) acpi_cpufreq ryzen_smu(O) Aug 24 20:42:28 Tower kernel: CR2: ffffffffa0d41728 Aug 24 20:42:28 Tower kernel: ---[ end trace 9ec0ff72c42472fe ]--- Aug 24 20:42:28 Tower kernel: RIP: 0010:Compute_AMD_Zen_Boost+0x33/0x8d0 [corefreqk] Aug 24 20:42:28 Tower kernel: Code: 56 41 55 41 54 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 41 89 ff 48 8b 05 cd 62 02 00 48 8b 40 20 4a 8b 04 f8 80 f8 08 00 00 00 00 00 00 48 8b 05 b4 62 02 00 48 8b 40 20 4a Aug 24 20:42:28 Tower kernel: RSP: 0018:ffffc90000bcb9c0 EFLAGS: 00010286 Aug 24 20:42:28 Tower kernel: RAX: ffffffffa0d40e30 RBX: 0000001000000000 RCX: 00000000436f7265 Aug 24 20:42:28 Tower kernel: RDX: 0000000000000000 RSI: ffff8885e7d27100 RDI: 0000000000000000 Aug 24 20:42:28 Tower kernel: RBP: ffffc90000bcba80 R08: 0000000000100002 R09: ffff888102f42000 Aug 24 20:42:28 Tower kernel: R10: ffffffffa0d40e30 R11: ffffffffa0d42a60 R12: ffff888102f4219c Aug 24 20:42:28 Tower kernel: R13: ffff8881cbc53c00 R14: 0000000000000000 R15: 0000000000000000 Aug 24 20:42:28 Tower kernel: FS: 0000153549d66740(0000) GS:ffff88882e840000(0000) knlGS:0000000000000000 Aug 24 20:42:28 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 24 20:42:28 Tower kernel: CR2: ffffffffa0d41728 CR3: 00000007455ec000 CR4: 0000000000750ee0 Aug 24 20:42:28 Tower kernel: PKRU: 55555554
@thor2002ro : you don't tell me enough. Which Processor ? Which Linux distribution ?
Was CoreFreq running ok with kernel 5.13 ? Master and/or Develop branch ?
EDIT: Issue could be linked with PTI or kernel mitigation mechanisms. Can you disable all of them at boot ?
the cpu is 5900x distro? does it matter ? fails to load module its a pure kernel thing(I think i cut the dmesg too early... the error occurs when loading the module without running the app).... but ok its unraid... I build my own kernels so there should be no interference from the distro.... running fine on 5.13 branch is develop.... mitigations are disable by default for me
I think 5.14 final will drop this week will see if it still breaks....
Just built 5.14.0-rc7 Except a few C fall-through warnings to fix up, CoreFreq is apparently running fine. Fyi: ArchLinux
EDIT: KFENCE is showing up when unloading driver, with 5.14 only.
corefreqk: loading out-of-tree module taints kernel.
corefreqk: module verification failed: signature and/or required key missing - tainting kernel
CoreFreq(14:30): Processor [ 8F_71] Architecture [Zen2/Matisse] SMT [32/32]
==================================================================
BUG: KFENCE: use-after-free read in __static_call_text_end+0x466/0x4af
Use-after-free read at 0x0000000031c62872 (in kfence-#151):
__static_call_text_end+0x466/0x4af
d_lookup+0x29/0x40
lookup_dcache+0x18/0x60
__lookup_hash+0x20/0xa0
kern_path_locked+0x9b/0x110
handle_remove+0x76/0x2e0
devtmpfs_work_loop.cold+0xc/0x13
devtmpfsd+0x25/0x34
kthread+0x12f/0x160
ret_from_fork+0x1f/0x30
kfence-#151: 0x00000000353de047-0x0000000026a95dc9, size=4096, cache=names_cache
allocated by task 239 on cpu 24 at 398.583578s:
getname_kernel+0x25/0x110
kern_path_locked+0x3f/0x110
handle_remove+0x76/0x2e0
devtmpfs_work_loop.cold+0xc/0x13
devtmpfsd+0x25/0x34
kthread+0x12f/0x160
ret_from_fork+0x1f/0x30
freed by task 239 on cpu 24 at 398.583584s:
kern_path_locked+0x68/0x110
handle_remove+0x76/0x2e0
devtmpfs_work_loop.cold+0xc/0x13
devtmpfsd+0x25/0x34
kthread+0x12f/0x160
ret_from_fork+0x1f/0x30
CPU: 24 PID: 239 Comm: kdevtmpfs Tainted: G OE 5.14.0-rc7-next-20210827-1-next-git #1
Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII HERO (WI-FI), BIOS 3801 07/30/2021
==================================================================
CoreFreq: Unload
interesting maybe it was an issue in rc6.... I'll wait for the final 5.14 and rebuild it and report back here,,,, should be any day now....
interesting maybe it was an issue in rc6.... I'll wait for the final 5.14 and rebuild it and report back here,,,, should be any day now....
Yes, let's wait and see but that thread->__state
should be renamed by its author.
Kernel source browser, such as Bootlin, exists for some good reasons ...
Marking original issue as bugfix
Develop branch appears to work fine with 5.14
Develop branch appears to work fine with 5.14
Thank you for your confirmation.
ok I got around some testing time today with 5.14 and I figured out what the problem was.... I was building my kernels with clang and full lto optimization.... seams the corefreqk module breaks when built with full lto(main binary and modules get optimized) clang building the kernel with thin lto(only the main binary gets lto) works no problem....
edit I do seam to get some messages tho.... and keeps working....
I'm interested in your build! You have to show me how you managed to compile the kernel module with clang ?
About the error. If not due to incompatible request; be aware that some keystrokes can be sent and miss interpreted by the UI. For example:
I build it like this....
make CC=clang LLVM=1 LLVM_IAS=1 DELAY_TSC=1 OPTIM_LVL=3 WARNING="-Wall -Wfatal-errors -static -pthread" KERNELDIR=$KERNEL_LOCATION all
make CC=clang LLVM=1 LLVM_IAS=1 DELAY_TSC=1 OPTIM_LVL=3 WARNING="-Wall -Wfatal-errors -static -pthread" KERNELDIR=$KERNEL_LOCATION all
Thanks but I still have this:
clang-12: fatal error: unknown argument: '-fplugin-arg-structleak_plugin-byref-all'
and these w/o -Wfatal-errors
:
clang-12: error: unknown argument: '-fplugin-arg-structleak_plugin-byref-all'
clang-12: error: unsupported option '-mrecord-mcount' for target 'x86_64-pc-linux-gnu'
now depends on distro ..... I remember you saying you're using arch remove "-static" arch doesn't have static libs, i think you can remove the hole WARNING section that replaces the default one.... on that front I always use ubuntu build vm's to build stuff I found over the years they are less headache building stuff for any distro and they have static libs :)
also make sure ld.bfd is supplying the ld , ld.gold doesnt play well with kernel stuff
edit: also you need to rebuild the kernel with clang.... it wont work building module with clang and kernel built with gcc.... I think the error is related to this
Using the 5.14.1-arch1-1
release of ArchLinux, the error Use-after-free read
has disappeared.
Later I will try to find time to build a full clang-ed kernel
So far, behavior of CoreFreq is nominal
Probably part of the issue: https://www.theregister.com/2021/09/08/compromise_linux_kernel_compiler_warnings/
That's 5.15 or can be tested in Linux next.... Should be fun they forced -Werror on everything 🤣