Open tsautereau-anssi opened 2 years ago
Ive been having my own issue for quite some time now, and I finally decided to track it down and I arrived here. OP seems correct. But there is more to the story, as the verify function is bugged during this sequence. My computer works perfectly fine except every 3-4 weeks one of these segfaults will take down the machine. I have captured it with Pstore & Netconsole, Always the same code path. prep_new_page calls verify_zero_highpage and hits BUG_ON memchr_inv kaddr 0. It seems almost like the page it just got is invalid. Please advise. I don't have a reproducer but it eventually triggers. Im willing to recompile some debug code in to pin it down.
Oops#1 Part2
<7>[932535.670960] RAX: ffffa053ed0f70c9 RBX: ffffd59017b43dc0 RCX: ffffa053ed0f70c9
<7>[932535.670963] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
<7>[932535.670965] RBP: ffffd59017b43dc0 R08: 0101010101010101 R09: 0000000000000080
<7>[932535.670967] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
<7>[932535.670969] R13: 0000000000000001 R14: ffffa04f706f5a00 R15: ffffd59017b43e00
<7>[932535.670972] FS: 00006fc01d878bc0(0000) GS:ffffa0559fb40000(0000) knlGS:0000000000000000
<7>[932535.670975] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<7>[932535.670977] CR2: 00006fbfe932a000 CR3: 0000000366b2a000 CR4: 0000000000750ee0
<7>[932535.670979] PKRU: 55555554
<7>[932535.670981] Call Trace:
<7>[932535.670987] ? __die_body.cold+0x1a/0x1f
<7>[932535.670991] ? die+0x2a/0x50
<7>[932535.670995] ? do_trap+0x83/0x100
<7>[932535.670998] ? do_error_trap+0x65/0x80
<7>[932535.671000] ? prep_new_page+0xf6/0x150
<7>[932535.671005] ? exc_invalid_op+0x49/0x60
<7>[932535.671007] ? prep_new_page+0xf6/0x150
<7>[932535.671010] ? asm_exc_invalid_op+0x12/0x20
<7>[932535.671014] ? prep_new_page+0xf6/0x150
<7>[932535.671017] get_page_from_freelist+0xa45/0x1970
<7>[932535.671022] __alloc_pages_nodemask+0x156/0x2f0
<7>[932535.671026] handle_mm_fault+0x57b/0x14e0
<7>[932535.671031] do_user_addr_fault+0x166/0x3a0
<7>[932535.671034] exc_page_fault+0x78/0x160
<7>[932535.671038] ? asm_exc_page_fault+0x8/0x30
<7>[932535.671097] asm_exc_page_fault+0x1e/0x30
<7>[932535.671100] RIP: 0033:0x6fc01daf46a4
<7>[932535.671169] Code: 00 0f 1f 44 00 00 c5 fe 6f 4e 20 f7 c1 00 0e 00 00 75 65 49 89 c9 48 8d 4c 16 ff 48 83 ce 3f 4a 8d 7c 0e 01 48 29 f1 48 ff c6 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 66 66 2e 0f 1f
Oops#1 Part3
<7>[932535.670915] ------------[ cut here ]------------
<2>[932535.670929] kernel BUG at include/linux/highmem.h:290!
<7>[932535.670937] invalid opcode: 0000 [#1] SMP NOPTI
<7>[932535.670941] CPU: 5 PID: 6065 Comm: Isolated Web Co Tainted: G O T 5.10.202-gentoo-hardened1-ZEN3iGPU-REV10 #213
<7>[932535.670944] Hardware name: ASRock X570 Phantom Gaming 4, BIOS P4.30 02/23/2022
<7>[932535.670953] Code: 48 89 df 48 2b 3d 7a d5 d3 00 31 f6 ba 00 10 00 00 48 c1 ff 06 48 c1 e7 0c 48 03 3d 74 d5 d3 00 e8 bf f9 2a 00 48 85 c0 74 bf <0f> 0b e9 23 00 00 00 e9 3c 00 00 00 f7 44 24 04 00 01 00 00 0f 84
<7>[932535.670957] RSP: 0000:ffffb1d803853c50 EFLAGS: 00010286
PAGE_SANITIZE_VERIFY as currently implemented in linux-hardened does not handle KASAN correctly. It misses a call to
kasan_disable_current()
(resp.kasan_enable_current()
) right before (resp. after) the call tomemchr_inv()
inverify_zero_highpage()
because we are reading memory that is still poisoned by KASAN. In addition and for the same reason, the virtual kernel address passed tomemchr_inv()
must be untagged beforehand via a call tokasan_reset_tag()
.I noticed this as I was rebasing linux-hardened onto v5.18, after reviewing several changes made to KASAN-related code in
post_alloc_hook()
, which made me wonder why KASAN wasn't complaining about our use ofmemchr_inv()
on poisoned pages. Long story short, it turns out that KASAN instrumentation oflib/string.c
(wherememchr_inv()
is defined) is simply turned off whenCONFIG_AMD_MEM_ENCRYPT
is enabled, which is actually the case for Arch Linux's linux-hardened package as well as for the numerous test builds that my colleague @nbouchinet-anssi was generously doing to help us verify our various hypotheses.Note that people trying to run a linux-hardened kernel on AArch64 with the default config plus hardware tag-based KASAN (which requires ARMv8.5 MTE), if any, would have encountered an error regardless of their
CONFIG_AMD_MEM_ENCRYPT
setting since this KASAN mode does not use compiler instrumentation to insert validity checks.