ClangBuiltLinux / linux

Linux kernel source tree
Other
241 stars 14 forks source link

ThinLTO breaks kernel with ZFS built-in #1731

Open yshui opened 2 years ago

yshui commented 2 years ago

I reported this in #1440, but after further investigation this looks like a different issue.

The symptom is the same as #1440:

[    0.853948][    T1] jump_label: Fatal kernel bug, unexpected op at swap_writepage+0x1c/0x60 [(____ptrval____)] (eb 1c 48 89 df != 66 90 0f 1f 00)) size:2 type:1
[    0.854952][    T1] ------------[ cut here ]------------
[    0.855258][    T1] kernel BUG at arch/x86/kernel/jump_label.c:73!
[    0.855617][    T1] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    0.855951][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-local+ #25
[    0.856363][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
[    0.856871][    T1] RIP: 0010:__jump_label_patch+0x190/0x1a0
[    0.857196][    T1] Code: 5e 41 5f 5d e9 d1 66 f7 ff 48 c7 c7 22 63 cf b9 4c 89 fe 4c 89 fa 4c 89 f9 49 89 d8 45 89 e1 41 56 e8 d9 a3 0d 00 48 83 c4 08 <0f> 0b 0f 0b 0f 0b 0f 0b 00 00 cc cc 00 00 cc cc 48 c7 c7 18 e1 25
[    0.858282][    T1] RSP: 0018:ffff8ad701173c18 EFLAGS: 00010286
[    0.858624][    T1] RAX: 000000000000008c RBX: ffffffffb9d62f21 RCX: ffffffffba265fc0
[    0.859063][    T1] RDX: 0000000000000000 RSI: c000000100010ae5 RDI: 0000000000000002
[    0.859508][    T1] RBP: ffffffffba6fe214 R08: 0000000000000000 R09: ffffffffba27e250
[    0.859949][    T1] R10: 00000000ffffffff R11: 0000000100010ae5 R12: 0000000000000002
[    0.860386][    T1] R13: ffffffffb9d62f21 R14: 0000000000000001 R15: ffffffffb7f77fac
[    0.860831][    T1] FS:  0000000000000000(0000) GS:ffff8ad704a00000(0000) knlGS:0000000000000000
[    0.861323][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.861690][    T1] CR2: ffff8ad707a01000 CR3: 0000000007210000 CR4: 0000000000350ff0
[    0.862130][    T1] Call Trace:
[    0.862311][    T1]  <TASK>
[    0.862472][    T1]  ? swap_writepage+0x1c/0x60
[    0.862734][    T1]  ? swap_writepage+0x2b/0x60
[    0.862990][    T1]  ? swap_writepage+0x1e/0x60
[    0.863247][    T1]  ? arch_jump_label_transform_queue+0x26/0x60
[    0.863592][    T1]  ? __jump_label_update+0x99/0x150
[    0.863880][    T1]  ? static_key_slow_inc_cpuslocked+0x4c/0x80
[    0.864214][    T1]  ? frontswap_register_ops+0x2c/0x40
[    0.864517][    T1]  ? init_zswap+0x19b/0x233
[    0.864770][    T1]  ? init_frontswap+0x9b/0x9b
[    0.865027][    T1]  ? do_one_initcall+0x120/0x2b0
[    0.865320][    T1]  ? do_initcall_level+0x7a/0xd8
[    0.865594][    T1]  ? do_initcalls+0x44/0x6b
[    0.865841][    T1]  ? kernel_init_freeable+0xd8/0x122
[    0.866131][    T1]  ? rest_init+0xc0/0xc0
[    0.866369][    T1]  ? kernel_init+0x11/0x1a0
[    0.866621][    T1]  ? ret_from_fork+0x22/0x30
[    0.866876][    T1]  </TASK>
[    0.867041][    T1] Modules linked in:
[    0.867290][    T1] ---[ end trace 0000000000000000 ]---

However, this only happens, if I enable the ZFS module as kernel builtin. (Of course, disabling CONFIG_ZSWAP helps, since it worksaround that jump label).

Interesting how ZFS can trigger an codegen change in code that is seemingly completely unrelated, through LTO.

This is probably related to openzfs/zfs#13549

yshui commented 2 years ago

Hmm, there is another problem. I worked around this issue (by disabling ZSWAP), but then the resultant kernel does not boot on real hardware, it boots in qemu, however.

0n-s commented 1 year ago

Chiming in as I've also been able to reproduce this exact issue earlier, even with an LLVM toolchain built from Git sources that are only a business week old (some of those TCs from that time had other problems like miscompiling sha512_ssse3, but not really relevant here, & also already fixed in newer revisions).

This seems to no longer be reproducible. I've built a kernel with CONFIG_ZFS=y, CONFIG_JUMP_LABEL=y, & CONFIG_ZSWAP=y (let me know if you would like the exact dotconfig) with fairly fresh LLVM & Clang main, specifically bfc812a2f32698ef383d486c25fa6abc001d6466, with both full & thin LTO (I've been able to reproduce the same issue with full LTO as well before) & the kernel boots just fine on QEMU, Cloud Hypervisor, & several pieces of real HW (a smorgasbord of x86_64 stuff).

I'm not really sure which commit fixed it, but at least the commit I used is no longer affected by this bug, at least from my (admittedly somewhat minimal) testing.

Versions of other things:

yshui commented 1 year ago

@0n-s thanks, i will try to repro again later.

KyunLFA commented 1 year ago

Hi. I seem to be hitting this same issue or an extremely similar one, with all but the ZFS being the same, with another out-of-tree CoW fs, Bcachefs (testing branch).

It is built-in as well, with ZSWAP=y and JUMP_LABEL=y, on 6.6 git master.

I will try using Clang+LLVM 18 main to see if its still reproducible there.

The main problem for me is that it does not produce any kernel output on boot, but compiling with CFI and LTO off reveals very similar kernel errors to the ones listed above, specially on non-clean unmounts.

FYI, there are parts of ZFS and Bcachefs that overlap, to the extent of having this many issues with the keyword "bcachefs" (wild guess: might this have something to do with kmem_cache_alloc?) (https://github.com/openzfs/zfs/pull/15143)

Kernel commit: b9ddbb0cde2adcedda26045cc58f31316a492215 LLVM/Clang version: 17.0.2 stable Bcachefs commit: e1aae900a671cad3ed51c252a0dda0c7e8a89362 OS: Chimera Linux rolling