Open Wowfunhappy opened 2 years ago
@krackers A race condition is worst case scenario I had imagined...😔 OK, I'll embed debug code to the remaining candidates.
@Wowfunhappy Ah, got it, there's a predecessor MessagePump mechanism. I'll try to revert to the POSIX one and see what happens.
I tried to revert to MessagePumpLibevent
, but ChannelMac
, CurrentIOThread
and app_shim
are too deeply dependent on MessagePumpKqueue
...
MessagePumpForIO
cannot be changed to MessagePumpLibevent
independently, so the current IPC mechanism needs to be nearly overhauled.
I think this method is a bit harsh at the moment...:pensive:
Oh well, thanks for trying!
I have discovered that I can consistently trigger the problem by idling on this awful page: https://gamebanana.com/games/5866. Make sure you are not using an adblocker. There should be an autoplaying video in the lower right of the screen.
The problem usually appears within 90 minutes. I do not know whether the tab needs to be in front, but the crash can occur when the window is entirely minimized.
When Chromium was launched via the Terminal, these messages are always logged at the time of the crash:
[22670:14595:0403/121124.951621:FATAL:message_pump_kqueue.cc(442)] Check failed: . kevent64: Bad file descriptor (9)
[1965:1287:0403/121126.049808:ERROR:network_service_instance_impl.cc(978)] Network service crashed, restarting service.
I assume the first message is printed by this PCHECK. I don't think this tells us anything we didn't already know; Chromium got a bad file descriptor in MessagePumpKqueue::DoInternalWork
.
However, I find the second message somewhat more notable. It looks like Network service
may be the responsible party?
Note that I have observed instances of the crash in which only the first message is printed, and not the second. However, when the crash occurs on the specific page I linked above, the second message is always present.
@krackers Does this look notable to you at all? https://github.com/blueboxd/chromium-legacy/commit/6a57f2bc3f617c441be7e8e8891919e983101c28
Annoyingly, bug 640281 is private, but it has something to do with guarding against invalid file descriptors, as well as the use of kqueue message pump.
I spotted this because Google recently changed the code for modern macOS compatibility, presumably causing https://github.com/blueboxd/chromium-legacy/issues/63. I don't have a complete theory as to why it would only have been failing on Chromium Legacy.
That change is from 2016 though? And adding the guards should only help things, not hurt us.
Well, I was thinking the guards might not be working for some reason. Or, we could potentially add more.
You could trace all close syscalls I guess (maybe using dtrace) and see what fd is being used
Okay @krackers, if you're still willing to help (I would need a lot of hand holding), I'd like to try making a kext patch. The more I look into the crash at the Chromium level, the more unfeasible fixing it there seems to be. I think there are actually a lot of different code paths within Chromium which can trigger the panic, because I keep finding command line switches that will remove the crash on one page, but not somewhere else.
You suggested starting with a kext to find the address of _kqueue_scan_continue and dumping the bytes to confirm the output is correct, using kaslr_slide
from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c. However, this relies on a kernel function vm_kernel_unslide_or_perm_external
which AIUI doesn't exist prior to El Capitan. Any idea how to get that address?
I also found https://github.com/leiless/ksymresolver which I think may be better, it's said to be able to resolve non-exported symbols. But it too relies on vm_kernel_unslide_or_perm_external
.
(Note that I may be away from my computer this weekend, but I'm done with finals so I should have more time in general!)
@wowfunhappy
doesn't exist prior to El Capitan
Interesting, thanks for pointing that out. Based on [1] as well as digging around other kexts, seems like without that function there are only 2 options for discovering the kaslr slide: either "back-read memory to find 0xfeedface" as mentioned in [1] (which I understand to be basically scanning the kernel space for the mach-o header to find the start, which seems to be what Lilu does?), or getting the runtime address of a known exported symbol (e.g. vnode_close
) and subtracting it from the address of the symbol in the on-disk binary, as done in [2]. Interestingly prior to 10.11 there was a kas_info() syscall to get this info from userspace [3], although I tried searching and couldn't find a way to read vm_kernel_slide
from kernel-space directly (or maybe it exists and I'm not doing it right).
I think the easiest will be to use the approach used in [2] where we basically set
#define VNODE_CLOSE_BINARY_ADDR 0xf00 // Open up kernel in hopper and put the address of the exported vnode_close symbol here
kaslr_offset = (mach_vm_address_t) &vnode_close - VNODE_CLOSE_BINARY_ADDR;
Then once we have the kaslr offset, we can do
#define KQUEUE_SCAN_CONTINUE_BINARY_ADDR 0x00f // Address of symbol in text segment as found in Hopper
kqueue_scan_continue = kaslr_offset + KQUEUE_SCAN_CONTINUE_BINARY_ADDR
uint8_t *kqueu_scan_continue_bytes = (uint8_t*) kqueue_scan_continue
Then hopefully do an IOLog("Dumping bytes at kqueue_scan_continue %02x %02x %02x", kqueu_scan_continue_bytes[0], kqueu_scan_continue_bytes[1], kqueu_scan_continue_bytes[2])
and check if the hexdump matches what you see in diassembler for the function
[1] https://www.zdziarski.com/blog/?p=6901 [2] https://github.com/cocoahuke/shrink_trackpad [3] https://github.com/gdbinit/kextstat_aslr/
This blog post also shows a more programmatic way to do approach (2):
http://ho.ax/posts/2012/02/resolving-kernel-symbols/
Basically manually scanning the LC_SYMTAB section of the binary to find the address of the symbol in the binary instead of just hardcoding after finding via Hopper as I suggested. There's an attached github project you can just use. Although I personally think just hardcoding the address of the smybol in binary is cleaner since there's no need to waste cpu cycles scanning through the list when we know which one it's going to be.
Thank you! I got it working!
https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321
This outputs 6b
71
75
, which matches what is in Hopper.
I initially used your method of getting the slide (which was invaluable, as it gave me a known-good result), but I managed to replace it with code adapted from Lilu, so I could hardcode only one address instead of two. Long term, I'd eventually like to make this work across different builds of XNU. (Long term being the key word, the address of kqueue_scan_continue is still hardcoded...)
What is the next step?
Edit: Eek, we posted at almost the exact same time!
Wow that was fast, nice job! Fyi you can also remove the hard-coded KQUEUE_SCAN_CONTINUE_BINARY_ADDR
and discover it on-the-fly by scanning LC_SYMTAB
as I mentioned in the previous post.
One thing though, the address of kqueue_scan_continue
you used seems to be in the strings section, not in the text segment. I loaded it in hopper, and it seems like there's no separate kqueue_scan_continue
function, I guess it was inlined during compilation.
You should use ffffff80005c79e0
instead, which is the function that contains the call to _panic
we want to avoid. (Since it won't be in LC_SYMTAB
either, I guess you can't avoid harcoding it after all).
It contains the line we want to avoid:
_panic("\"%s: - invalid wait_result (%d)\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", "kqueue_scan_continue", LODWORD(r15), rcx, r8, r9, STK-1);
Now the next step is to find which bytes we need to patch. Recall the original switch statement was
switch (wait_result) {
case THREAD_AWAKENED:
kqlock(kq);
error = kqueue_process(kq, cont_args->call, cont_args, &count,
current_proc());
if (error == 0 && count == 0) {
wait_queue_assert_wait((wait_queue_t)kq->kq_wqs,
KQ_EVENT, THREAD_ABORTSAFE, cont_args->deadline);
kq->kq_state |= KQ_SLEEP;
kqunlock(kq);
thread_block_parameter(kqueue_scan_continue, kq);
/* NOTREACHED */
}
kqunlock(kq);
break;
case THREAD_TIMED_OUT:
error = EWOULDBLOCK;
break;
case THREAD_INTERRUPTED:
error = EINTR;
break;
default:
panic("%s: - invalid wait_result (%d)", __func__,
wait_result);
error = 0;
}
and the decompiled-assembly is
if (LODWORD(r15) != 0x2) {
LODWORD(r12) = 0x23;
if (LODWORD(r15) != 0x1) {
if (LODWORD(r15) == 0x0) {
r15 = r14 + 0x8;
_lck_spin_lock(r15);
r12 = *(rbx + 0x68);
_current_proc();
LODWORD(r12) = LODWORD(sub_ffffff80005c7390(r14, r12, rbx + 0x68, var_2C));
if (LODWORD(LODWORD(var_2C) | LODWORD(r12)) == 0x0) {
rcx = *(r13 + 0x78);
_wait_queue_assert_wait(*r14, 0x0, 0x2, rcx, r8, r9, STK-1);
*(int8_t *)(r14 + 0x58) = *(int8_t *)(r14 + 0x58) | 0x2;
_lck_spin_unlock(r15);
_thread_block_parameter(sub_ffffff80005c79e0, r14, 0x2, rcx, r8, r9);
}
_lck_spin_unlock(r15);
}
else {
_panic("\"%s: - invalid wait_result (%d)\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", "kqueue_scan_continue", LODWORD(r15), rcx, r8, r9, STK-1);
LODWORD(r12) = 0x0;
}
}
}
else {
LODWORD(r12) = 0x4;
}
I was originally thinking we could try inserting a trampoline to our own code in the else of if (LODWORD(r15) == 0x0) {
in order to add the additional check for EBADF, but we can also do the lazy thing and just assume that any unhandled case must be EBADF, noop the call to panic
, and set the code to EBADF
instead of 0x0
. (Basically we'd be betting that there's no way there's no way there's another unhandled wait_result
type (there are 2 remaining wait_result_t
types, THREAD_WAITING, and
THREAD_NOT_WAITING`, but given that even latest xnu source doesn't add anything else I guess it's a safe bet that we won't see those two types here).
If we do the former approach, we use one of the trampolining libraries I think I mentioned in one of the previous posts. Let me know if you want to do this approach, and we can go in more detail.
If we do the latter, we basically need to rewrite
ffffff80005c7a9c lea rdi, qword [ds:0xffffff8000766e5e] ; "\\\"%s: - invalid wait_result (%d)\\\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", argument #1 for method _panic, XREF=sub_ffffff80005c79e0+68
ffffff80005c7aa3 lea rsi, qword [ds:0xffffff8000766ebb] ; "kqueue_scan_continue", argument #2 for method _panic
ffffff80005c7aaa mov edx, r15d ; argument #3 for method _panic
ffffff80005c7aad xor al, al
ffffff80005c7aaf call _panic
ffffff80005c7ab4 xor r12d, r12d
to be mov r12d, (whatever the numeric value of EBADF is, I think its 0x9 from errno.h but please double check)
followed by a bunch of nops
to fill out the remaining space.
If we do this approach then it's easier (just a bit tedious though). We can do something like
uint8_t replacement_bytes = {41, BC, 09, 00, 00, 00 } // assembled mov r12d, 0x9 with padding to match existing
memcpy(kaslr_base + 0xffffff80005c7a9c, replacement_bytes, 6);
memset(kaslr_base + 0xffffff80005c7aa3, 0x90 /*noop*/, 0xffffff80005c7ab4 - 0xffffff80005c7aa3);
And then maybe dump out the bytes after replacing to make sure it's what you want.
I think we also may need to disable interrupts before doing patching. I found this code in lilu
bool MachInfo::setInterrupts(bool enable) {
unsigned long flags;
if (enable)
asm volatile("pushf; pop %0; sti" : "=r"(flags));
else
asm volatile("pushf; pop %0; cli" : "=r"(flags));
return static_cast<bool>(flags & EFL_IF) != enable;
}
which probably does what we want, but actually seems like apple already provides us ml_set_interrupts_enabled
which is the exact same code (probably cleaner to use apple's one if it's already there): https://developer.apple.com/documentation/kernel/1593365-ml_set_interrupts_enabled
I also don't know if there is any sort of W^X in kernel land we need to bypass. I guess you can try it and if it doesn't allow us to write there'll be a kernel panic or something when we try to do so.
Maybe take a look at Lilu Kernel patching to see if there's any other steps we need:
there's a MachInfo::setKernelWriting
in there, looks like in addition to disabling interrupts we need to unset a write protect bit. So I guess there is indeed W^X in kernel space (makes sense). Still not sure why they decide to flip CR0 instead of using whatever kernel-space equivalent of vm_protect
there is, but I guess former is easier,
Thank you! Should be back at my computer tomorrow...
One thing though, the address of kqueue_scan_continue you used seems to be in the strings section, not in the text segment. I loaded it in hopper, and it seems like there's no separate kqueue_scan_continue function, I guess it was inlined during compilation.
Thanks, I was actually thinking that seemed off...
Fyi you can also remove the hard-coded KQUEUE_SCAN_CONTINUE_BINARY_ADDR and discover it on-the-fly by scanning LC_SYMTAB as I mentioned in the previous post.
To confirm, this would basically be the same as how I found the slide value, right? Start with a normal exported symbol and then scan forwards until I find what I'm looking for. Since the actual kqueue_scan_continue
function was inlined and the symbol doesn't exist, I guess I could just look for the hex sequence of the panic message?
but we can also do the lazy thing and just assume that any unhandled case must be EBADF, noop the call to panic, and set the code to EBADF instead of 0x0. (Basically we'd be betting that there's no way there's no way there's another unhandled wait_result type (there are 2 remaining wait_result_t types, THREAD_WAITING, and THREAD_NOT_WAITING`, but given that even latest xnu source doesn't add anything else I guess it's a safe bet that we won't see those two types here))
Totally on board with this approach!
I suppose it is worth considering... in the scenario where the kernel did send either THREAD_WAITING or THREAD_NOT_WAITING, how inappropriate would it be to send an EBADF (bad file descriptor)? Is there a more generic error we should consider returning instead? We have some options in https://www.freebsd.org/cgi/man.cgi?query=errno&sektion=2&manpath=freebsd-release-ports, assuming of course that XNU is the same.
(I do find it interesting that there are other unhandled types. Maybe that indicates this bug wasn't quite as stupid as it initially appeared? E.g., maybe Apple had a reason to think this shouldn't ever happen, and didn't just blatantly forget to handle a case.)
Start with a normal exported symbol and then scan forwards until I find what I'm looking for.
I guess I could just look for the hex sequence of the panic message?
The panic message is unfortunately also in the strings segment. So I'm not really sure what's the best way to discover the offset we want to patch at runtime. The best option just seems to be hardcoding the offset (discovered via xref using hopper). Alternatively you could try scanning for the specific instructions we want to replace, but I don't know if that's really any better than hardcoding since there's no guarantee the instructions would be the same between kernel versions (e.g. in another version compiler might have reordered the switch statement differently, or used some other register instead of r12d for return code).
in the scenario where the kernel did send either THREAD_WAITING or THREAD_NOT_WAITING
I think EBADF should probably be fine there, since the userspace code can handle it as if the syscall failed. But anyway from what I can tell it's not something that should ever happen anyway for 2 reasons:
1) The empirical evidence that we've never seen a kernel panic implicating that type, and Apple also hasn't updated recent kernels to add it as a case
2) Looking at the code flow, it seems like if we enter the kqueue_scan_continue the thread must have been put into a waiting state and now it's not waiting anymore due to some interrupt, context switch, etc., which means that neither THREAD_WAITING
or THREAD_NOT_WAITING
would be valid values
maybe Apple had a reason to think this shouldn't ever happen, and didn't just blatantly forget to handle a case
Yes it's probably that. Maybe the invariant was true at some point but later it became violated. That's why I think usually doing exhaustive switch statement is better since it forces you to confront all posibilities rather than sweeping under the rug with default. And forces you to document all the implicit assumptions for "unrechable" code.
@krackers So I'm here at the moment:
https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321
When I comment out lines 55 and 56 (to dump the bytes again and see if they changed), the kext loads fine. However, with those lines—to see if it actually did anything—it KP's.
I haven't done the W^X thing yet because I figured I'd try without it first—but, why would the kernel not panic unless I try to actually print the bytes I modified? If you think write protection is the problem, I'll do that next, as long as the rest of the code so far looks correct to you.
I'm still trying to get rid of hard coded addresses...
memset
'ing 0xffffff80005c7ab4 - 0xffffff80005c7aa3
?I care about this because I eventually want to integrate the kext into my PrefPane. If I can't find the offset dynamically, my backup plan is to grab them from hopper for a bunch of different kernels (at minimum, the final releases of Lion, Mountain Lion, and Mavericks plus Bronya's Ryzen builds), and select the correct one at runtime by seeing which address has the bytes I expect.
You want to do
memcpy(kqueue_scan_continue_panic_location, replacement_bytes, 6);
memset(kqueue_scan_continue_panic_location + 7, 0x90 /*noop*/, 0xffffff80005c7ab4 - 0xffffff80005c7aa3);
Reason is that kqueue_scan_continue_panic_location
is already a ptr (vm_offset_t is typedef'd to a ptr type). If you do memcpy to &kqueue_scan_continue_panic_location
you are actually overwriting the value of kqueue_scan_continue_panic_location
itself, not writing to the location pointed to by it. Or in other words, you are changing value in the stack instead of actually modifying text segment (possibly smashing your stack as well if the number of bytes written is more than allocated for this frame). That is why you get segfault when you try printing subsequently, because you've now rewritten kqueue_scan_continue_panic_location
to the value 0x90909090 or something like that, and it's probably pointing to garbage so dereferencing via *kqueue_scan_continue_panic_location
creates segfault.
You also need to flip the CR0 write protect bit via the previously mentioned function.
Also btw you probably want to print out the entire basic block after modifying, not just the first few values.
Can we dynamically find the size of the procedure
Yes, in principle. In general what we care about is the size of the basic block, and I suppose if you pulled in some disassembler (there are simple x64 opcode diassemblers) you can scan forward until you find an BB edge (jmp, jne, je, etc.). I think for this specific case it should suffice to scan forward until you see an xor after a call to panic (still need to pull in an x64 disassembler though)
but how difficult would it be to find the function address the same way I can in Hopper
Simple in principle, but maybe tedious in practice. You basically just do same way Hopper (or any static analysis tool does it), by scanning instructions and finding lea
with address operand that references the string. But to do this we need to pull in disassembler, and maybe a bit of mach-o header parsing (since x64 is all relative addressing so we need to know the offset between text and data segment.. it's probably fixed, but I don't know for sure). There are some simple library dissassemblers (e.g. Zydis) which you can use for this.
Maybe it's fun to play around with, but I personally am lazy and think it's easier to just have the user disassemble their kernel and recompile the kext with the new constants. Reason is that even if you are able to get away with discovering the basic block size and offset at runtime, it's still not guaranteed that return value will be in r12d
. So you also need to discover that at runtime, and basically pattern match for something like
lea _,, <panic string addr>
...
call _panic
xor retVal, retVal
I suppose technically you could try just pattern matching the assembled opcodes directly, which means you don't need to bring in a disassembler. But then you need to check how the instructions are assembled (e.g. if we want to match xor r12d, r12d
or more generically xor-ing some register to zero, figure out which bits in the assembled instruction 45 31 e4
represent the registers and which is the xor opcode. x64 is irregular enough (to me at least) though that this seems even more brittle.
Oh hah apparently vm_offset_t
is actually typedef'd to uintptr_t which isn't really a pointer type, just an integer type. So I guess you need to cast it (to void*), since we want to let the compiler you actually want to use it as a pointer.
Alright, no finding offsets dynamically! :)
Aaaaand... I think I've done it! @krackers can you look over this?
https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321
Calling out a couple of things in particular:
0x00
to the end of the replacement_bytes
you gave me, it looked right based on hopper and was needed for the numbers to line up. Does this seem okay?panic()
, in places where I didn't want to risk leaving the kernel in an unknown state (e.g. without write protection), do these seem reasonable to you? How is the level of defensiveness in general?This definitely builds and is replacing the memory, so I'm going to give it a try on my own (non-virtual) machine. I've also attached the kext—but use at your own risk!
I added an additional 0x00 to the end of the replacement_bytes you gave me
I don't think we should do that? The address for the individual instruction isn't supposed to line up because lea rdi, qword
is different instruction size than mov r12d, 0x9
. If you throw mov r12d, 0x9
in an assembler the instruction should only be 6 bytes. We add the noops afterwards to fill out the rest of the basic block though, so nothing else gets affected (i.e. the address alignment is restored for everything after the basic block we're patching).
If you added the 00, then the instruction stream is now parsed as
| 0x41, 0xBC, 0x09, 0x00, 0x00 | 0x00 | 0x90 | 0x90 |
but then the extra 0x00 is left over after decoding, and that doesn't seem to be valid x64 opcode. I'm quite surprised it didn't panic when you did that...
I guess my comment assembled mov r12d, 0x9 with padding
was misleading, there's no padding in there, that is the actual assembled form of mov r12d, 0x9
. (Maybe I had mistakenly thought r12d
was a 64-bit register since the fact that dword
is actually only 32-bits always trips me up). The actual version with padding using an imm64 instead of imm32 would be mov r12, 0x9
which is actually 7 bytes long, so maybe you can use that if you really want instruction-level alignment).
Other than that, nice job! Level of defensiveness seems good, and checking for a match on the expected bytes before replacing is a good idea. Fyi you don't need to define a uint8_t *kscpb2
for the second print bytes, the kscpb
pointer from before is still valid and you can just use that.
I don't think your strategy of looking for search_bytes
will work in other versions though since the lea
operand references the string in DS relative to RIP, and that's pretty much guaranteed to not be the same between versions.
Also while the replacement logic seems fine to me, probably better to double check by dumping memory from a few bytes before kqueue_scan_continue_panic_start_location
to a little after kqueue_scan_continue_panic_end_location
, then throw those bytes in a disassembler. You should expect that the bytes right before kqueue_scan_continue_panic_start_location
are untouched, the bytes right after kqueue_scan_continue_panic_end_location
are untouched, and the middle bytes are as expected.
Ah the reason why it didn't panic when you did that is probably because it hadn't hit it yet. If we did begin executing that codepath, willing to bet that you'd have seen a kpanic.
Once you fix that it should work though, but probably better to do the memory dump mentioned above to make sure there's no off-by-one issues anywhere. You probably have to wait to see if it works or not, checking the chromium crash logs (since you mentioned that with patched kernel returning EBADF, no kpanic but chromium process still crashes).
Oh one more crucial defensive check you can do is add the two following asserts:
kqueue_scan_continue_panic_end_location == kqueue_scan_continue_panic_start_location + sizeof(search_bytes)
kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes) + extra_space_to_fill == kqueue_scan_continue_panic_end_location
Ah the reason why it didn't panic when you did that is probably because it hadn't hit it yet. If we did begin executing that codepath, willing to bet that you'd have seen a kpanic.
Yep! I came here to say that I'd gotten a kernel panic. :)
For now as a very quick fix, I've made the last replace_byte
0x90 instead of 0x00. The reason I thought that instruction had to be seven bytes was because otherwise I couldn't seem to end up with the right amount for extra_space_to_fill
, when I tried to calculate it with only the start, end, and replacement instructions as hardcoded. But I don't remember exactly what I was thinking at the time, and I'm too tired now... 😴
Just waiting for the crash log...
Alright, after another day of futzing with it, I can confirm definitively that this works on 10.9.5 (XNU 2422.115.15). I got the "Bad File Descriptor" crash log from Chromium.
kqueue_scan_continue-tune-2022.06.05.zip
https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321
Note that this kext uses the bundle identifier foo.tun
. This allows it to load from /Library/Extensions despite not being codesigned.
In theory, this should also work on fully-updated copies of Lion (XNU 1699.32.7) and Mountain Lion (XNU 2050.48.19).
Nice, looks good! Fyi in terms of style, to make the main code easy to follow you could factor out all the constants and helpers into the .h
file. Also you can probably take advantage of clang blocks to refactor the flow into something much more readable like:
kern_return_t kqueue_scan_continue_tune_start(kmod_info_t * ki, void *d) {
vm_offset_t kqueue_scan_continue_panic_start_location = 0;
vm_offset_t kqueue_scan_continue_panic_end_location = 0;
if (!get_kqueue_scan_continue_locs(&kqueue_scan_continue_panic_start_location, &kqueue_scan_continue_panic_end_location)) {
return KERN_FAILURE;
}
unsigned long extra_space_to_fill = kqueue_scan_continue_panic_end_location -
kqueue_scan_continue_panic_start_location - sizeof(replacement_bytes);
assert(kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes) + extra_space_to_fill == kqueue_scan_continue_panic_end_location);
bool succ = do_with_interrupt_disabled(^bool() {
return do_with_wp_disabled(^bool() {
memcpy((void *)kqueue_scan_continue_panic_start_location, replacement_bytes, sizeof(replacement_bytes));
memset((void *)kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes), 0x90 /*nop*/, extra_space_to_fill);
printf("kqueue_scan_continue-tune: Memory rewritten\n");
return true;
});
});
if (!succ) {
return KERN_FAILURE;
}
printf("Post-patch bytes: ");
dump_bytes((void*) kqueue_scan_continue_panic_start_location, 40);
}
where you have the following helpers defined in the .h
void dump_bytes(void *start, int num_bytes) {
for (int i = 0; i < num_bytes; i++) {
printf("%02x ", ((uint8_t*) start)[i]);
}
printf("\n");
}
bool get_kqueue_scan_continue_locs(vm_offset_t *start_addr_out, vm_offset_t *end_addr_out) {
vm_offset_t kernel_base = get_kernel_base();
vm_offset_t kqueue_scan_continue_panic_start_location = 0;
vm_offset_t kqueue_scan_continue_panic_end_location = 0;
char search_bytes[sizeof(possible_search_bytes[0])];
char replacement_bytes[sizeof(possible_replacement_bytes[0])];
uint8_t *kscpb = NULL;
for (int i = 0; i < LENGTH(possible_kqueue_scan_continue_panic_start_locations); i++) {
kqueue_scan_continue_panic_start_location = kernel_base + possible_kqueue_scan_continue_panic_start_locations[i];
kqueue_scan_continue_panic_end_location = kernel_base + possible_kqueue_scan_continue_panic_end_locations[i];
memcpy(search_bytes, possible_search_bytes[i], sizeof(search_bytes));
memcpy(replacement_bytes, possible_replacement_bytes[i], sizeof(replacement_bytes));
kscpb = (uint8_t*) kqueue_scan_continue_panic_start_location;
if (memcmp(kscpb, search_bytes, sizeof(search_bytes)) == 0) {
break;
}
if (i == LENGTH(possible_kqueue_scan_continue_panic_start_locations) - 1) {
printf("kqueue_scan_continue-tune: Memory region not found. You are probably using an unsupported kernel, or your kernel has already been patched.\n");
return false;
}
}
printf("kqueue_scan_continue-tune: Pre-Patch: Bytes at kqueue_scan_continue panic location" );
dump_bytes(kscpb, 40);
*start_addr_out = kqueue_scan_continue_panic_start_location;
*end_addr_out = kqueue_scan_continue_panic_end_location;
return true;
}
bool checked_set_interrupt_enabled(bool newVal) {
printf("kqueue_scan_continue-tune: Set interrupt enabled to %d\n", newVal);
ml_set_interrupts_enabled(newVal);
bool succ = ml_get_interrupts_enabled() == newVal;
if (!succ) {
printf("kqueue_scan_continue-tune: Failed to set interrupt status to %d\n", newVal);
}
return succ;
}
bool checked_set_wp(bool newVal) {
printf("kqueue_scan_continue-tune: Set CR0 write protect to %d\n", newVal);
set_cr0(newVal ? get_cr0() | CR0_WP : get_cr0() & ~CR0_WP);
bool succ = write_protection_is_enabled() == newVal;
if (!succ) {
printf("kqueue_scan_continue-tune: Failed to set write protect status to %d\n", newVal);
}
return succ;
}
bool do_with_interrupt_disabled(bool (^func)(void)) {
boolean_t interrupts_were_enabled = ml_get_interrupts_enabled();
if (interrupts_were_enabled && !checked_set_interrupt_enabled(false)) {
return false;
}
bool succ = func();
if (interrupts_were_enabled && !ml_get_interrupts_enabled() && !checked_set_interrupt_enabled(true)) {
panic("kqueue_scan_continue-tune: Failed to re-enable interrupts!\n");
}
return succ;
}
bool do_with_wp_disabled(bool (^func)(void)) {
boolean_t write_protection_was_enabled = write_protection_is_enabled();
if (write_protection_was_enable && !checked_set_wp(false)) {
return false;
}
bool succ = func();
if (write_protection_was_enabled && !write_protection_is_enabled() && !checked_set_wp(true)) {
panic("kqueue_scan_continue-tune: Failed to re-enable write protection!\n");
}
return succ;
}
Also can you explain more about
Note that this kext uses the bundle identifier foo.tun. This allows it to load from /Extra/Extensions despite not being codesigned.
I didn't even know /Extra/Extensions was a valid place that osx loaded kexts from, nor that it ignores codesigning for tun/tap driver. Do you have a link where I can read more about this?
Also I was never really sure about 10.9 kext codesign requirements. I can manually kextload an unsigned kext fine, but it gives a warning. I haven't tried installing to /Library/Extensions
but I assume it would fail to load in that case? But /System/Library/Extensions
will accept unsigned? Does ad-hoc signature suffice for /Library/Extensions?
Edit: are you sure /Extra/Extensions works for non-hackintosh? kextd source doesn't seem to show that as a possible location https://opensource.apple.com/source/IOKitUser/IOKitUser-1445.40.1/kext.subproj/OSKextPrivate.h
Btw kext loading is done via kextd
which is open soruce. As described in https://reverse.put.as/2013/11/23/breaking-os-x-signed-kernel-extensions-with-a-nop/ seems like codesigning is indeed enforced only for /Library/Extensions
OSStatus sigResult = checkKextSignature(theKext, true);
--
| if ( sigResult != 0 ) {
| if ( isInLibraryExtensionsFolder(theKext) \|\|
Also looks like ad-hoc won't work since it checks specific chain
/* set up correct requirement string. Apple kexts are signed by B&I while
--
| * 3rd party kexts are signed through a special developer kext devid
| * program
| */
| myCFString = OSKextGetIdentifier(aKext);
| if (CFStringHasPrefix(myCFString, __kOSKextApplePrefix)) {
| requirementsString = CFSTR("anchor apple");
| }
| else {
| /* DevID for kexts cert
| */
| requirementsString =
| CFSTR("anchor apple generic "
| "and certificate 1[field.1.2.840.113635.100.6.2.6] "
| "and certificate leaf[field.1.2.840.113635.100.6.1.13] "
| "and certificate leaf[field.1.2.840.113635.100.6.1.18]" );
| }
I didn't even know /Extra/Extensions was a valid place that osx loaded kexts from, nor that it ignores codesigning for tun/tap driver. Do you have a link where I can read more about this?
Oops! When I said /Extra/Extensions I meant /Library/Extensions, I just wrote the wrong thing. (/Extra/Extensions is indeed a dumb Hackintosh convention.)
As you noted, Mavericks will load unsigned kexts from any location except /Library/Extensions, aka the one place where third party kexts logically should live. It's frustrating!
Not only will ad-hoc codesigning not work, but even a paid developer account is not sufficient, you need to get special approval by Apple.
But, there's a list of bundle identifiers that are allowed to live in /Library/Extensions and be unsigned: System/Library/Extensions/AppleKextExcludeList.kext/Contents/Info.plist
But, now I'm thinking I may let the kext live in /Library/Extensions, but have a launchdaemon that copies it to a temporary location before loading it, instead of usurping an old bundle identifier...
But, there's a list of bundle identifiers that are allowed to live in /Library/Extensions and be unsigned
I see,AppleKextExcludeList.kext
contains both a list of blacklisted kexts which are prevented from loading
https://github.com/st3fan/osx-10.9/blob/master/xnu-2422.1.72/libkern/c++/OSKext.cpp#L4643
as well as as a section on whitelisted kexts that are allowed to skip the codesign check:
https://github.com/st3fan/osx-10.9/blob/master/kext_tools-326.1.12/security.c#L1192
Neat, thanks for teaching me something new.
@wowfunhappy actually, here's a good workaround. Invalid signature is only warning for kext in /SLE, so you can inform user to manually update the whitelist plist and then place in /LE. This seems like a clean option, better than installing the kext in /SLE directly I guess.
@krackers Yeah, so I debated which option was the least hacky:
/System/Library/Extensions
./Library/Extensions
and adding a new bundle identifier to AppleKextExcludeList.kext
./Library/Extensions
and usurping a pre-existing bundle identifier inside AppleKextExcludeList.kext
.Yesterday, I chose option 4, because I don't think anything with the bundle identifier foo
is in widespread use. (The list must have been automatically generated somehow, as it also whitelists a bunch of Hackintosh kexts.)
However, I'm actually going with a fifth option that didn't originally occur to me—store the kext in /Library/Extensions/
, but have a launchdaemon copy it to /tmp/
before loading. It turns out that non-IOKIt kexts in /Library/Extensions/
aren't autoloaded anyway, so a launchdaemon is needed regardless.
Thanks for the style notes. I'm think I'm going to leave it be for now, there's not a lot of code and it feels naturally procedural, so I actually prefer having it all in one place.
Anyway, unless I discover a new issue I'm going to try to close the book on this for now (partly because I need to mentally focus on other work).
@blueboxd As a quick recap, we've created a kext that fixes this problem on 10.7.5 (all security updates applied), 10.8.5 (12F2560), and 10.9.5 (13F1911) by modifying the procedure in kernel memory to return EBADF
instead of panicking. (Other releases are not currently supported.)
If I don't discover any new issues in the next week or so, I'm going to add this to my Preference Pane, so it gets installed automatically for anyone who uses that. I don't know if it makes sense to incorporate this into the Chromium Legacy app directly, but you are of course welcome to it.
@Wowfunhappy Wow!! thank you for the great work!! I'll try to find a way to integrate this mitigation into Chromium.app.
Quick note that the kext doesn't work in 32-bit Lion. I briefly looked into adding support today, but from what I can tell, my method of finding the KASLR slide doesn't work, and I don't want to rewrite the entire thing.
@Wowfunhappy Wouldn't you have to rewrite (or at least recompile) anyway because in 32-bit mode the mach-o magic is different and header struct layout probably also differs? The approach of scanning memory to look for 0xfeedface should still work though...
Btw I was surprised that 64-bit chromium worked on 32-bit xnu, but then I remember I watched a presentation on how 32-bit xnu had support for running 64-bit userspace processes, so maybe less surprising. I don't remember the presentation though, I'll try to find it since I forget exactly how they supported this.
Edit: Found it, it was a CCC talk: https://www.youtube.com/watch?v=-7GMHB3Plc8 Also found this article: https://appleinsider.com/articles/08/10/28/road_to_mac_os_x_snow_leopard_64_bit_to_the_kernel
@krackers ...y'know what, that might be all it is, I forgot I was literally searching for a header called "MH_MAGIC_64" 🤦♂️. I'll take another look.
Recompiling isn't a problem, I can just lipo
together two different binaries that technically have different code. (It does make the built cycle a bit more annoying unless I also take the time to automate that.)
The kext has been updated with compatibility for 32-bit Lion.
https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321
(The strategy for finding the kernel base address works fine in 32 bit, but is pointless because Lion lacks KASLR.)
@Wowfunhappy
If I don't discover any new issues in the next week or so, I'm going to add this to my Preference Pane, so it gets installed automatically for anyone who uses that. I don't know if it makes sense to incorporate this into the Chromium Legacy app directly, but you are of course welcome to it.
Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.
While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly, given that:
Given item 3, the kernel patch on test systems could be useful in identifying the combination of syscalls that provokes the bug. And assuming that the offending code doesn't provide any new capability, there must be something in Chromium which changed from an "old way" to a "new way" of doing something. Once that difference can be figured out, the next questions are:
It would certainly be useful to also have a kernel fix for users who want a more robust kernel and are willing to install a kernel update, but it would be best to avoid making that a requirement for avoiding this bug. And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch. Some work would be needed to identify the correct sources to use as a starting point, though fortunately that needs to be done only once for any given abandoned OS version.
@fhgwright
Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.
The preference pane does ask for permission!
The installation is "automatic" in the sense that, if the user says yes, it's done in one click.
Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!
While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly
I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!
there must be something in Chromium which changed from an "old way" to a "new way" of doing something.
My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175
How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?
It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...
How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?
Wide enough that Bluebox wasn't able to do it. https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1086776672
And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.
Already done!
https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1019278490
This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.
The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.
@fhgwright
Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.
The preference pane does ask for permission!
The installation is "automatic" in the sense that, if the user says yes, it's done in one click.
Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!
I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.
It should let you know whether the patch is already present, and offer an uninstall option as well. But both goals could be mostly served by giving the full path of the extension in the text.
While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly
I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!
there must be something in Chromium which changed from an "old way" to a "new way" of doing something.
My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175
How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?
It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...
How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?
Wide enough that Bluebox wasn't able to do it. #44 (comment)
But if it's true that the change involved moving away from using FDs for signaling due to FD exhaustion issues, then given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.
I doubt that I'd have trouble with FD exhaustion here anyway, since I crank up the limit for other reasons, and I'm not the sort to keep hundreds of browser tabs open simultaneously.
And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.
Already done!
But that didn't match the release kernel, and the discrepancy was never investigated.
This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.
The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.
Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.
I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.
Correct, it does not update itself.
It should let you know whether the patch is already present, and offer an uninstall option as well.
While this does not currently happen, uninstalling the PrefPane (via the provided uninstall script) will also uninstall the kext. In addition, the kext is installed in /Library/Extensions/
, the standard location for third-party kernel extensions.
But that didn't match the release kernel, and the discrepancy was never investigated.
We did investigate. It does not match the release kernel because the release kernel is not open source. Which is why we went with a kernel extension instead.
Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.
The open source version of XNU is explicitly missing a function required by iMessage. This was actually documented by the Hackintosh community, I just wasn't aware previously.
given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.
Yes, but the Linux build uses Linux-specific code that won't work on XNU, so it's harder than it seems. But again, by all means, please give it a shot. I don't know if it will work, but it might, and I'd love to see this fixed within Chromium Legacy itself. It's just beyond my own ability.
Not sure if this will be valuable for you but I reproduced this on 10.9.4 and here are the logs. I was leaving https://piped.mha.fi on, intending to type something but haven't decided, so no interaction with the machine yet. And my Macbook Pro 2012 kernel paniced.
If it's not helpful please let me know and I will remove the comment.
Kernel_2023-01-14-184339_Hoangs-MacBook-Pro.panic.log
Anonymous UUID: 58645209-3F22-8B42-01B7-0C62519750C7
Sat Jan 14 18:43:39 2023
panic(cpu 0 caller 0xffffff80071c6bb4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.110.17/bsd/kern/kern_event.c:2167
Backtrace (CPU 0), Frame : Return Address
0xffffff81f2053ef0 : 0xffffff8006e22f79
0xffffff81f2053f70 : 0xffffff80071c6bb4
0xffffff81f2053fb0 : 0xffffff8006ed7417
BSD process name corresponding to current thread: Chromium Helper
Boot args: kext-dev-mode=1
Mac OS version:
13E28
Kernel version:
Darwin Kernel Version 13.3.0: Tue Jun 3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64
Kernel UUID: BBFADD17-672B-35A2-9B7F-E4B12213E4B8
Kernel slide: 0x0000000006c00000
Kernel text base: 0xffffff8006e00000
System model name: MacBookPro9,1 (Mac-4B7AC7E43945597E)
System uptime in nanoseconds: 4066641044880
last loaded kext at 3901636049533: com.apple.driver.AppleIntelMCEReporter 104 (addr 0xffffff7f8924c000, size 49152)
last unloaded kext at 3966662167467: com.apple.driver.AppleIntelMCEReporter 104 (addr 0xffffff7f8924c000, size 32768)
loaded kexts:
...
Just to confirm, is this with Wowfunhappy's kext installed? That should that should hot-patch the kernel to avoid these panics (instead it will "merely" crash just the chromium renderer instead of bringing down your entire system).
Also fyi 10.9.4 is not the latest version of mavericks, you should probably be on 10.9.5.
Just to confirm, is this with Wowfunhappy's kext installed? That should hot-patch the kernel to avoid these panics
Not if they're on 10.9.4 it won't! I'd need to add memory addresses for that kernel.
Is there a reason this system is on 10.9.4 instead of 10.9.5? If there's something "special" about 10.9.4 in particular, I could add support for it.
I was on 10.9.4 because I never realised that it wasn't the latest Mavericks 😂😂😂
I kind a skimmed through the end of the issue and didn't spot the workaround, I'll update to 10.9.5 and try it, will report back after some testing.
On Sat, 14 Jan 2023, 11:04 pm Jonathan, @.***> wrote:
Just to confirm, is this with Wowfunhappy's kext installed? That should hot-patch the kernel to avoid these panics
Not if they're on 10.9.4 it won't! I'd need to add offsets for that kernel.
Is there a reason this system is on 10.9.4 instead of 10.9.5? If there's something "special" about that version, I could add support for it.
— Reply to this email directly, view it on GitHub https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1382956474, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGIECOLYIBDWF5C6HZBG3WSMWGXANCNFSM5LXTPZ7A . You are receiving this because you commented.Message ID: @.***>
I was on 10.9.4 because I never realised that it wasn't the latest Mavericks 😂😂😂 I kind a skimmed through the end of the issue and didn't spot the workaround, I'll update to 10.9.5 and try it, will report back after some testing.
Quick heads up that Apple also released further updates after the base 10.9.5 which don't change the version number, you want 10.9.5 build 13F1911. I'm pretty sure the updater will prompt you, but if not, install: https://support.apple.com/kb/dl1886?locale=en_US
Yep, on 13F1911, installed the kext through the preference pane, will see how it works over the week.
Thanks!
On Sun, 15 Jan 2023, 12:18 am Jonathan, @.***> wrote:
I was on 10.9.4 because I never realised that it wasn't the latest Mavericks 😂😂😂 I kind a skimmed through the end of the issue and didn't spot the workaround, I'll update to 10.9.5 and try it, will report back after some testing.
Quick heads up that Apple also released further updates after the base 10.9.5 which don't change the version number, you want 10.9.5 build 13F1911. I'm pretty sure the updater will prompt you, but if not, install: https://support.apple.com/kb/dl1886?locale=en_US
— Reply to this email directly, view it on GitHub https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1382977652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGIEHJWX4DWYI67R5L5NTWSM665ANCNFSM5LXTPZ7A . You are receiving this because you commented.Message ID: @.***>
I've been avoiding reporting this, because I feel like it's going to be impossible to fix. :(
Occasionally while using Chromium Legacy, OS X kernel panics with a distinctive backtrace. The backtrace implicates Chromium Helper, and I've never seen the panic occur when I'm not using Chromium. I've observed this on my desktop and laptop running 10.9, as have others on MacRumors. I think it may happen more often on Macs with less memory?
A log is attached. Notably, I have so far never managed to capture a panic while I had
keepsyms
enabled, I'd like to capture one at some point.Kernel_2022-01-11-182654_Jonathans-MacBook-Air.panic.zip
Everything beyond this point is speculation, and probably useless, due to my limited understanding of C. But, I'm going to include it anyway. :)
I'm technically running XNU version 2422.115.15, which doesn't appear to be open source, but the closest version with source available is 2422.115.4. The file being referenced in the log is here: https://opensource.apple.com/source/xnu/xnu-2422.115.4/bsd/kern/kern_event.c.auto.html
The line numbers don't match up, possibly because of the tiny version mismatch, but I assume this is the referenced function:
So, the switch statement expects the
wait_result
parameter to be eitherTHREAD_AWAKENED
,THREAD_TIMED_OUT
, orTHREAD_INTERRUPTED
. If it's none of these—as appears to be the case for us—it panics with our error, "invalid wait result".So, what is being passed in as
wait_result
? Well, it's printing as the number "3", but I don't know enough C to understand how these integers are mapped to their human-readable names. However, note that a fourth option has been added to this switch statement in newer versions of XNU: https://opensource.apple.com/source/xnu/xnu-7195.81.3/bsd/kern/kern_event.c.auto.htmlI'm going to hazard a guess that when the kernel panics,
kqueue_scan_continue
is getting called with a value ofTHREAD_RESTART
in thewait_result
parameter, which is understood on new versions of XNU but not the older ones we're using. Is there any way to catch and handle this?