Open Wowfunhappy opened 2 years ago
When I've triggered this...by chance.. I seem to always be switching windows either through hot corners invoking Mission Control or clicking the Chromium Dock Icon to switch windows..... FWIW.... or maybe switching tabs.... I've noticed sometimes when switching tabs the pointer does not take focus of the new tab until a subsequent click in the tab being switched to...seems to occur after the browser has been running for a good amount of time.
@Wowfunhappy Mapping is given in https://opensource.apple.com/source/xnu/xnu-4570.61.1/osfmk/kern/kern_types.h
You're right in that 3 corresponds to THREAD_RESTART. Not too familiar with kqueue internals but just based on statically analyzing the call hierarchy, seems to be
kevent_internal -> kqueue_scan -> kqueue_scan_continue
It's really interesting that kevent syscall was only recently updated to return EBADF
though. Seems like this is returned whenever the fd becomes invalid (e.g. if it's closed), which would definitely explain why it's sporadic and hard to reproduce. I wonder if this bug was just never noticed on older versions of osx because nothing else other than chrome seems to make such heavy use of IPC, and older versions of Chromium probably used some other method for synchronization/ipc [1]?
There are only a few places in chromium source that make the kevent
syscall (search for
"kevent" "#include <sys/event.h>"
). Might be interesting to add some debugging code around that to see if there's any pattern.
[1] https://source.chromium.org/chromium/chromium/src/+/2096391cdcc3bdf963e68977e9015aa8cdbe85c6
https://source.chromium.org/chromium/chromium/src/+/c96c632d5501d5ebba14e1130fdea8b15ac29756
Thanks! I wonder if this was the underlying project which introduced the problem. The timing lines up, 9–12 months after 10.9 support was dropped from Chrome.
Aside, I assume it's not possible to just replace kqueue_scan_continue
with a version that handles THREAD_RESTART
, right? Because it's an internal kernel function, or something like that?
@Wowfunhappy Excellent question. It is possible if you are willing to install a kext. Basically since all the kexts will live in the same address space as the kernel code, you can hot-patch the kernel by modifying the instructions in memory.
Now to actually do this patch you'll need to be a bit clever:
kqueue_scan_continue
). This will be our magic number that is the offset to patch from base.Conceptually not too difficult, and not that different from code injection in userspace. Only difference is that stakes are higher. If you have a VM and are willing to play around with it, I can give you some code pointers. Of course at this point it might just be easier to recompile the kernel itself since we have the source anyway...
Thanks, makes sense it would need a kext. Probably not ideal if it can be helped.
(Just for my own use, I am a bit tempted to recompile XNU. But I run into this panic so rarely anyway... Some people on MacRumors were saying it happens daily for them.)
MacBiter over on MacRumors captured a backtrace with keepsyms
enabled.
panic(cpu 6 caller 0xffffff80055c7ab4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167
Backtrace (CPU 6), Frame : Return Address
0xffffff81f05a3ef0 : 0xffffff8005223139 mach_kernel : _panic + 0xc9
0xffffff81f05a3f70 : 0xffffff80055c7ab4 mach_kernel : _kqueue_scan + 0x8f4
0xffffff81f05a3fb0 : 0xffffff80052d7c67 mach_kernel : _call_continuation + 0x17
I assume this is _call_continuation
? Still an internal kernel function, I naively thought it might give us Chromium's initial syscall.
Edit: And I think this is the definition... I wasn't expecting to find ASM...
Edit 2: I don't actually understand most of what it's saying, but pages 418 – 420 of Jonathan Levin's Mac OS X and iOS Internals may be relevant here. I think it's saying that wait_result
is set by the thread_block_reason() function.
I'm just trying to figure out where that wait_result
value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet. It would make sense if Chromium was the one setting wait_result
, since Google is only targeting modern kernels. But this is kernel code calling kernel code. I guess the obvious answer is that it's a bug, but it seems like something that wouldn't have gone unnoticed for years.
...well, for what it's worth, I decided to build a custom kernel. As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete. The only tricky part was that I had to install Xcode 5.0.2; later versions did not appear to work.
The only source change is that I added the THREAD_RESTART
case to kqueue_scan_continue
, which sets error = EBADF
as in newer versions of XNU.
I am (a bit nervously) going to try using this on my main machine. A half hour in, nothing catastrophic has happened so far.
The kernel is attached in case anyone else on Mavericks wants to try it, but I am not necessarily recommending that anyone do so. If you do use it, make sure at minimum to have a secondary partition available to restore your original kernel should the need arise. Remember that you will need to clear the kernel cache on your first boot with the new kernel.
@krackers Do you have any idea why this kernel would be only 8,257,152 bytes, when the original kernel and every custom Hackintosh kernel I've checked is 8,394,688 bytes? It's weirding me out.
@Wowfunhappy sorry for the late response.
I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet
Note that even in the older xnu version, THREAD_RESTART
was still a valid defined value for wait_result_t
. In fact, you can see that it is returned in wait_queue_assert_wait64
of osfmk/kern/wait_queue.c
when the queue becomes invalid. The difference is that in this older kernel version, kqueue scan did not properly handle this, and just panicked instead of returning EBADF.
I agree that this seems way too obvious an omission to be a bug. But I can't think of any other explanation at the moment (that said, my knowledge here is very slim). I think the first step might be to see if we can create a scenario that can reliably reproduce this. Based on my current understanding, it seems like the conditions for triggering would look like:
Have one process call kqueue
to create a new fd for event queue, and then call kevent
to wait upon it
While that happens, another process closes the fd
Based on what's described above, this would generate a THREAD_RESTART
condition in the wait queue code which would bubble up to kqueue_scan
and cause the panic.
But this seems way too obvious and easy to trigger, so I'm still skeptical that this is actually what's happening.
As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete
I see you're following the "Kernel Shaman's" build instructions? I'm not quite sure why there would be a difference in size assuming that both are release builds, but it's possible that apple internal builds have some other ifdefs
set during compile, or the kernel version is different? This wouldn't explain why the hackintosh kernels match in size though...
That's also why I'm personally partial to the "hotpatch via kext" approach. That way you introduce only the minimal difference necessary.
@krackers I'm not ignoring your suggestions of things to try, they're just mostly outside of my skill level. 🙂 I only barely understand explanations of what a file descriptor is...
Anyway, the plot thickens. I neglected to check something earlier:
That THREAD_RESTART
case was only added to XNU in 3789.1.32, aka macOS 10.12 Sierra.
But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?
Could this indicate the problem was introduced by Chromium Legacy? If I'm not missing anything, I think the only Chromium Legacy changes that involve kevent
come from this commit. https://github.com/blueboxd/chromium-legacy/commit/e0cff58db7f41e2b0b68e058ec49a98f215265c5
Google's code review comments for this change are interesting...
On latest canary..Version 100.0.4889.0 (Official Build) (x86_64) two finger scrolling in a facebook messenger chat....
Tue Feb 15 06:40:24 2022 panic(cpu 0 caller 0x55bb8e): "kevent_scan_cont() - invalid wait_result (3)"@/SourceCache/xnu/xnu-1699.32.7/bsd/kern/kern_event.c:1982 Backtrace (CPU 0), Frame : Return Address (4 potential args on stack) 0xe2bf68 : 0x2203de (0x6b08cc 0xe2bf88 0x229fb0 0x0) 0xe2bf98 : 0x55bb8e (0x7006a4 0x3 0x6 0xa4c39d4) 0xe2bfc8 : 0x2c7d0c (0x9f03320 0x3 0x10 0xa4e9190)
BSD process name corresponding to current thread: Chromium Helper
Mac OS version: 11G63
Kernel version: Darwin Kernel Version 11.4.2: Thu Aug 23 16:26:45 PDT 2012; root:xnu-1699.32.7~1/RELEASE_I386 Kernel UUID: 859B45FB-14BB-35ED-B823-08393C63E13B System model name: MacBook4,1 (Mac-F22788A9)
System uptime in nanoseconds: 71096824466362 last loaded kext at 27108605406: com.apple.driver.AudioAUUC 1.59 (addr 0x146b000, size 24576) last unloaded kext at 115835844270: com.apple.driver.USBCameraFirmwareLoader 1.1.0 (addr 0xd35000, size 24576) loaded kexts: com.apple.driver.AudioAUUC 1.59 com.apple.filesystems.autofs 3.0 com.apple.driver.AppleHWSensor 1.9.5d0 com.apple.driver.AppleHDA 2.2.5a5 com.apple.driver.AppleUpstreamUserClient 3.5.9 com.apple.driver.AppleBacklight 170.2.2 com.apple.driver.AppleMCCSControl 1.0.33 com.apple.driver.AppleIntelGMAX3100 7.0.4 com.apple.driver.AppleIntelGMAX3100FB 7.0.4 com.apple.driver.SMCMotionSensor 3.0.2d6 com.apple.iokit.IOUserEthernet 1.0.0d1 com.apple.iokit.IOBluetoothSerialManager 4.0.8f17 com.apple.Dont_Steal_Mac_OS_X 7.0.0 com.apple.driver.AudioIPCDriver 1.2.3 com.apple.driver.ApplePolicyControl 3.1.33 com.apple.driver.ACPI_SMC_PlatformPlugin 5.0.0d8 com.apple.driver.AppleLPC 1.6.0 com.apple.driver.AppleSMCPDRC 5.0.0d8 com.apple.driver.CSRUSBBluetoothHCIController 4.0.8f17 com.apple.AppleFSCompression.AppleFSCompressionTypeDataless 1.0.0d1 com.apple.AppleFSCompression.AppleFSCompressionTypeZlib 1.0.0d1 com.apple.BootCache 33 com.apple.driver.AppleUSBTrackpad 227.6 com.apple.driver.AppleUSBTCKeyEventDriver 227.6 com.apple.driver.AppleUSBTCKeyboard 227.6 com.apple.driver.AppleIRController 312 com.apple.iokit.SCSITaskUserClient 3.2.1 com.apple.driver.XsanFilter 404 com.apple.iokit.IOAHCIBlockStorage 2.1.0 com.apple.driver.AirPortBrcm43224 501.36.15 com.apple.driver.AppleFWOHCI 4.9.0 com.apple.driver.AppleHPET 1.7 com.apple.driver.AppleUSBHub 5.1.0 com.apple.driver.AppleAHCIPort 2.3.1 com.apple.driver.AppleIntelPIIXATA 2.5.1 com.apple.driver.AppleUSBEHCI 5.1.0 com.apple.driver.AppleUSBUHCI 5.1.0 com.apple.driver.AppleEFINVRAM 1.6.1 com.apple.driver.AppleRTC 1.5 com.apple.iokit.AppleYukon2 3.2.2b1 com.apple.driver.AppleSmartBatteryManager 161.0.0 com.apple.driver.AppleACPIButtons 1.5 com.apple.driver.AppleSMBIOS 1.9 com.apple.driver.AppleACPIEC 1.5 com.apple.driver.AppleAPIC 1.6 com.apple.driver.AppleIntelCPUPowerManagementClient 195.0.0 com.apple.nke.applicationfirewall 3.2.30 com.apple.security.quarantine 1.4 com.apple.security.TMSafetyNet 8 com.apple.driver.AppleIntelCPUPowerManagement 195.0.0 com.apple.kext.triggers 1.0 com.apple.driver.DspFuncLib 2.2.5a5 com.apple.driver.AppleBacklightExpert 1.0.4 com.apple.driver.AppleSMBusController 1.0.10d0 com.apple.driver.AppleHDAController 2.2.5a5 com.apple.iokit.IOHDAFamily 2.2.5a5 com.apple.iokit.IOSurface 80.0.2 com.apple.iokit.IOFireWireIP 2.2.5 com.apple.iokit.IOAudioFamily 1.8.6fc18 com.apple.kext.OSvKernDSPLib 1.3 com.apple.driver.AppleGraphicsControl 3.1.33 com.apple.iokit.IONDRVSupport 2.3.4 com.apple.iokit.IOGraphicsFamily 2.3.4 com.apple.iokit.IOSerialFamily 10.0.5 com.apple.driver.AppleSMC 3.1.3d10 com.apple.driver.IOPlatformPluginLegacy 5.0.0d8 com.apple.driver.IOPlatformPluginFamily 5.1.1d6 com.apple.driver.AppleUSBBluetoothHCIController 4.0.8f17 com.apple.iokit.IOBluetoothFamily 4.0.8f17 com.apple.driver.AppleUSBMergeNub 5.1.0 com.apple.iokit.IOUSBHIDDriver 5.0.0 com.apple.driver.AppleUSBComposite 5.0.0 com.apple.iokit.IOSCSIMultimediaCommandsDevice 3.2.1 com.apple.iokit.IOBDStorageFamily 1.7 com.apple.iokit.IODVDStorageFamily 1.7.1 com.apple.iokit.IOCDStorageFamily 1.7.1 com.apple.iokit.IOATAPIProtocolTransport 3.0.0 com.apple.iokit.IOSCSIArchitectureModelFamily 3.2.1 com.apple.iokit.IO80211Family 420.3 com.apple.iokit.IOFireWireFamily 4.4.8 com.apple.iokit.IOUSBUserClient 5.0.0 com.apple.iokit.IOAHCIFamily 2.0.8 com.apple.iokit.IOATAFamily 2.5.1 com.apple.iokit.IOUSBFamily 5.1.0 com.apple.driver.AppleEFIRuntime 1.6.1 com.apple.iokit.IONetworkingFamily 2.1 com.apple.iokit.IOHIDFamily 1.7.1 com.apple.iokit.IOSMBusFamily 1.1 com.apple.security.sandbox 177.11 com.apple.kext.AppleMatch 1.0.0d1 com.apple.driver.DiskImages 331.7 com.apple.iokit.IOStorageFamily 1.7.2 com.apple.driver.AppleKeyStore 28.18 com.apple.driver.AppleACPIPlatform 1.5 com.apple.iokit.IOPCIFamily 2.7 com.apple.iokit.IOACPIFamily 1.4
But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?
Do those (macOS 10.11 Chrome users) actually exist? ;)
Even last April, macOS 10.10 had a much larger share than 10.11.
Those still running 10.11 may be using Safari/Firefox/Opera, or blame their kernel panics on running an old OS version.
(I also suspect many are headless or server systems that haven't been updated - these types of systems probably don't see much interactive web browsing.)
Personally, my MacBook Pro still is running Mavericks, but I seem to remember thinking (for some long-lost rationale) that if I were ever to bump it forward, I'd be stopping at 10.10, 10.12, 10.13, or 10.15.
AFAIK, there aren't many (any?) macs where that is the latest installable version.
Systems tend to max out on 10.4, 10.6.8, 10.7.5, 10.11, 10.13.x, etc...
MacMini3,1
s and other similar late 2008/early 2009 Penryn/Nehalem models ... the late 2009 models tend to support up to 10.13.x ... I don't remember exactly, but the demarcation point may have been 'Metal' framework support by the integrated GPU.Tl;dr: it's plausible that it had been broken all along (i.e. not backported to earlier versions when it was finally discovered / fixed in XNU 3789.1.32
).
Do those (macOS 10.11 Chrome users) actually exist? ;)
I mean, anecdotally, I've seen several of them in the wild! I even helped a third grade schoolteacher add the Let's Encrypt certificate to Keychain Access last fall.
This may well be an kernel bug†, but I don't think it's happening in official versions of Chrome. A kernel panic is more difficult to overlook than a missing email, and this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently. I don't see how Google, with all of their users and analytics, wouldn't know about this problem if it existed.
It seems much more likely to me that https://github.com/blueboxd/chromium-legacy/commit/e0cff58db7f41e2b0b68e058ec49a98f215265c5 is culpable, somehow...
† Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.
Consider:
Also:
Neither one of those fixed kernels will get to the majority of affected users here, so we do have to continue trying to mitigate this kernel bug by rolling back or modifying our userspace code.
Tl;dr: Thank you for trying to hunt it down!
Looking again at the kernel kevent code, it seems there are very few places where THREAD_RESTART
is actually signaled. The most likely source for the one received in kqueue_scan
seems to be
https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335
/* If it is an invalid wait queue, you cant wait on it */
if (!wait_queue_is_valid(wq))
return (thread->wait_result = THREAD_RESTART);
and in the kevent code kq->kq_wqs
is the wait queue as part of the kq struct. This again makes me think that somehow the kqueue is being closed by some other process and we're not handling that.
@Wowfunhappy can you add some log statements to each of the sites in kern_event.c
that call into wait_queue_assert
, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq)
. (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.
As for userspace workarounds, assuming that hypothesis is correct then what I can think of is to try to ensure that no thread is waiting on a kqueue that is to be closed. So we'd have to send some sort of interrupt signal to break the wait on kqueue before closing the fd? Not too sure since I'm not too familiar with these parts...
As for the reverted commit you linked, at first glance I don't see anything too suspicious there. Why did that commit need to be reverted for 10.9 compatibility though? I don't see anything that's 10.11+ specific.
@Wowfunhappy can you add some log statements to each of the sites in
kern_event.c
that call intowait_queue_assert
, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by!wait_queue_is_valid(wq)
. (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.
I'll try next week!
As for the reverted commit you linked, at first glance I don't see anything too suspicious there.
I don't understand this code, so I'm just looking at the timing of when things were added and the differences between codebases. I'm assuming:
The offending code has to make use of the kevent
system call.
There has to be something different in Chromium Legacy's codebase and mainline Chromium's codebase, which is causing Chromium Legacy to kernel panic on 10.11 and below but not mainline Chromium. (This is based on a separate assumption that Chromium Legacy would kernel panic if used on 10.11, since 10.11 contains the offending XNU code.)
The commit I linked contains the only difference I could find between the two codebases which uses kevent
.
@Wowfunhappy Assumption 1 isn't necessarily true. If we use the working hypothesis that the kpanic is caused by attempting to use an invalid (freed) kqueue, then you'd also have to search for references for code which uses MessagePump, since when that instance goes out of scope its destructor will free the kqueue fd.
There's also exactly 1 other place where THREAD_RESTART
is signalled, and that's in some sephamore related method. If the kpanic is triggered by that, then we're really in a pickle since that would mean the root cause might not even be strictly related to kqueue operations.
@krackers I ran into some stupid trouble guarding the logging statements. wait_queue_is_valid
doesn't exist in kern_event.c.
I tried to copy the two defines from: https://github.com/aosm/xnu/blob/0a13c7a4dcb9ab989727c2b0e08f7861457ed069/osfmk/kern/wait_queue.h#L161, adding:
#define _WAIT_QUEUE_inited 0x2
#define wait_queue_is_valid(wq) \
(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)
But the compiler complained:
no member named 'wq_type' in 'struct wait_queue'
The wait_queue
struct is defined in kern/wait_queue.h
, and wq_type
is ostensibly public, but adding #include <kern/wait_queue.h>
to kern_event.c
had no effect. I don't want to copy-paste the struct into kern_event.c
because I don't know if defining it in a new place will do something weird.
I'm sure I'm doing something dumb. 🙂
For now, I've attached a kernel with unguarded logging statements before each call of wait_queue_assert
in kern_event.c
. The source is this revision of: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d/440e40ec82da87f4240f22981a020e0f169ea915. It works in my VM, but it's noisy enough that I don't want to leave it running on my real system. It seems wait_queue_assert_wait_with_leeway
is called from kqueue_scan
many times per second.
Also for reference, the guarded version I can't compile: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558 XNU-2422.115.4-THREAD_RESTART add-JA-v2022.02.21.zip d
Edit: Issue was closed by accident, please disregard.
@Wowfunhappy
No you're not doing something wrong, the way their includes work is tricky and took me a good 10 minutes to figure out.
Very roughly, when all the headers for are put together the net result is something like
#include <stdio.h>
typedef struct wait_queue *wait_queue_t;
#ifndef MACH_KERNEL_PRIVATE
struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types
#else
typedef struct wait_queue { // from wait_queue.h
unsigned long int
/* boolean_t */ wq_type:2, /* only public field */
wq_fifo:1, /* fifo wakeup policy? */
wq_prepost:1, /* waitq supports prepost? set only */
wq_eventmask:2;
int wq_interlock; /* interlock */
int wq_queue; /* queue of elements */
} WaitQueue;
#define wait_queue_is_valid(wq) \
(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)
#endif
In particular, note that we actually have two different declarations of the wait_queue
struct that are selected based on whether MACH_KERNEL_PRIVATE
. I assume one is intended be used akin to an opaque-type (think "private" in an OO language) to hide internal members, while the other is used only within the wait_queue.c
. Interestingly for opaque types you normally just leave the declaration without any fields, but here they explicitly initialize with some dummy fields of the proper size, which I assume was done so sizeof()
would work propery.
This is also why just #including
didn't work, since the proper #define
wasn't set, so it effectively never made it past the preprocessor.
Now let's work around this. You're right in that we don't want to just directly copy-paste the struct since we'll run into duplicate definition error. Instead let's do it in a slightly smarter way where we create our own struct with identical layout and then cast to that to access fields.
typedef struct wait_queue *wait_queue_t;
struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types
#define EVENT_MASK_BITS
typedef struct wowfunhappy_wait_queue { // happy little wait queue
unsigned long int
/* boolean_t */ wq_type:2, /* only public field */
wq_fifo:1, /* fifo wakeup policy? */
wq_prepost:1, /* waitq supports prepost? set only */
wq_eventmask:((sizeof(long) * 8) - 4);
} WowFunHappyWaitQueue;
#define _WAIT_QUEUE_inited 0x2
#define wait_queue_is_valid(wq) \
(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)
void buzz(wait_queue_t wq) {
printf("%d", wait_queue_is_valid((WowFunHappyWaitQueue*) wq));
}
** Note 2 important things. 1) This is technically undefined behavior I think, although if you were to use a union instead of a cast to do the type-punning then it might be standards compliant? But who cares about standards so long as it works (and casting should indeed work).
2) Note that I only included the initial member of the struct here, and left out the other 2. This should be fine because IIRC type-punning is safe for any prefix of shared fields as well. **
Unfortunately, I have been forced to revert back to a vanilla XNU kernel on my personal machine.
I woke up this morning and found that my Mac had been logged out of iMessage. When I tried to log back in, the app complained of an unknown error. After a lot of reboots, password changes, network changes, and other dead ends, I finally tried reverting back to stock XNU, and iMessage instantly began working again.
If anyone else is using one of the kernels I attached, and iMessage stops working for you, you know how to fix it. And for what it's worth, I experienced zero panics during my month with the custom kernel.
That's interesting. Based on what I found online it seems cause is
Publicly-available XNU sources lack the necessary functions needed for generating the IOPower data (just look at IORootParent::initialize() in the kernel binary vs. what's in the source).
I assume that the hackintosh community has worked around this by just doing patching instead. There seems to be a robust patching framework they've built up (Lilu). Another alternative might have been to create a kext that does populate these values in the io registry, but I don't know why they didn't go that route. I'm not too familiar with the hackintosh scene so I don't know how they obtain these values in the first place if they don't have an existing mac to copy from (basedo n https://dortania.github.io/OpenCore-Post-Install/universal/iservices.html#fixing-en0 looks like they generate random SNs??)
I personally don't like the idea of using a monolithic framework to do the patching (and Lilu is bloated since it tries to handle both userspace and kernel patching). Looking at ([Onyx the black cat[(https://github.com/gdbinit/onyx-the-black-cat)) is a good reference since it contains some useful functions we might want (especially the write enable on kernel pages). Most of those patches are in-place binary patches though, whereas for a slightly more involved patch like this I'd usually want to insert a trampoline.
I tried to find any existing examples of kexts which have a trampoline insert that we can steal (there are many userspace ones like mach_override that we can port if really needed since the core logic is the same, but I'm lazy and don't want to think too much). I found
(interestingly all same to be the same-ish file, but don't have any attribution...)
Er actually doing a trampoline is going to be a bit annoying because we need to ensure the return value is actually used. We can either implement the trampoline'd function in assembly so we don't need to worry about C messing things up, try to fuss around with the proper syntax to ensure that wait_result
gets mapped to the function param and the return value gets mapped to the register that contains error
, or just forget about doing this properly and patch the default
section to return EBADF
directly and hope that this doesn't mask any other bug.
In terms of hacker spirit I still like the trampoline though, and if we set up the framework for that it will be easy to insert the logging I mentioned. After going through the candidates I like the approach taken by
since it is the easiest for me to understand and flexible enough to coax into doing what we want.
@Wowfunhappy Either way (whether or not we opt to use trampoline) let's start off with creating a simple kext that just reads the contents of kqueue_scan_continue
function address to make sure we are accessing the correct address. https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c has a nifty solve_kernel_symbol
function but I don't think we can use it because kqueue_scan_continue
is not exported function. Instead, we need to find that corresponding block in the disassembled binary to get the relative address (relative to kernel base) and then add the kaslr offset to it to get the absolute address in memory.
Once you have the relative address, can you create a basic "hello world" kext (https://craftware.xyz/tips/Building-kernel-extension.html) that uses the kaslr_slide
function from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c to dump out some bytes from the absolute address to see if they match the assembly we expect? Once we have that then we can take the next steps...
The Hackintosh community has some good reasons for avoiding custom kernels. Apple releases source code on a delay, and if you're running a non-ancient version of macOS, every update will potentially overwrite your custom kernel with a stock one, and leave you with a non-bootable system if you aren't careful. The more recent OSS releases also leave out really critical things like sleep and power management.
I'd thought that on an EOL OS, there was a certain purity to just fixing an obvious bug in an open-source component and recompiling—after all, what good is code that's never built? But it was never a great solution anyway, and it appears that even Apple's older code releases aren't as complete as I thought.
I am a Hackintosh user (on one of my two machines) and I'm already using Lilu, so it might make sense for me to go that route; it mostly depends on what I can actually manage to do. Downside: AIUI, Lilu has to be injected via a Hackintosh bootloader, because it needs to be loaded before the kexts it's patching. (Although, wait a minute, we're not trying to patch a kext, we're trying to patch the kernel—will this work at all? I know you said it's all the same address space, but then why does the loading order matter...?)
But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users. The custom kernel had the advantage of being easy, I was done in a couple hours.
How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in userspace?
Edit: I started writing this before your more recent replies/edits @krackers. Okay, if I attempt to keep going with this I'll follow from your starting point, thank you! I would still like to step back for a minute and consider whether this a smart avenue to keep exploring—what are we hoping to find out?
@Wowfunhappy
Yes, Lilu will work since it has support for patching kernel functions as well. But I don't like it here because I don't think using Lilu is going to give us any strong benefit that we could not have ourselves by copying a few header files for useful functions. And the Lilu only takes care of boilerplate, but we must still bring core logic. And that too we'd have to learn the Lilu API so the benefits of having boilerplate taking care of is negated. I think it's much simpler and more understandable to just make vanilla kext?
But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users.
End result will be to create a kext that users can just place in /Library/Extensions
or whatever to hotpatch kernel on the fly. Will only work for users with same kernel version that you based offsets on though, other kernel versions people will have to recompile kext with their own offset. It's also fun for learning, but yes it is a bit more work to do than just recompiling xnu. Up to you whether the pain of occasional kernel panic is worth spending a few weekends trying to create a kext though. Happy to provide guidance along the way. I edited previous repose https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1048316473 with some concrete starting steps if you've decided to take this route.
How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in user-space?
To be honest, I don't know. If we can determine for sure that it is caused by a freed kqueue fd, then we need to take next step to figure out why the fd is being freed. This would involve adding some logging code in Chromium (maybe in destructor for messagepump?). If we verify that is indeed the cause, then next step would be to create some work around where we try to only free the kqueue after we ensure no thread is blocked waiting on one. We could try to make that change now itself but to me seems like the effort required to figure out how to implement this (would have to study exactly what type of event to send to interrupt existing kqueue wait) combined with effort to recompile chromium is much more than doing the kext.
Other option is if you can try to find commits involving messagepump we can try to see if one of them might be responsible, but this is more of a shot in the dark compared to the kernel patch which we know will work. If you're optimizing for time spent vs. reward then the kernel patch approach seems to be a surer bet.
It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner. Can we run chrome under gdb or with dtrace to figure out where the kevent
syscall is being made?
(Btw, @wowfunhappy I noticed many of your plugins usually done via insert_dylib
approach, you should consider packaging as a simbl plugin instead since it's much easier to enable/disable and you don't break signatures).
@krackers I'm going to continue the tangential discussion a bit here to try to leave this issue as focused as possible.
It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner.
The crash log is pretty explicit though:
BSD process name corresponding to current thread: Chromium Helper
If it was a system framework, wouldn't that be showing up instead?
I also took a quick look in Instruments, and can confirm that Chromium Helper is certainly making BSC_kevent64
syscalls. Whether these are the syscalls that sometimes lead to a panic, I have no idea.
Edit: All coming from one place:
If it was a system framework, wouldn't that be showing up instead
I think even if call into system framework, it's still part of same chrome helper process?
But yes based on your trace seems like it is indeed chromium message pump. Can you try recompiling chromium add adding some logging in the destructor for messagepumpkqueue
(assuming that is the class which holds the kqueue
reference)? I assume that it's one messagepump per chrome helper process, which means it might not be too noisy to run it and see if object destroy is related to kernel panic trigger.
I'll try if I can! It may be a while, grad school is picking up so I'm trying (key word) to spend less time screwing around with computers.
Bluebox, if you happen to read this, (I assume you're busy with other life stuff), you may want to add this logging yourself since you have the hardware to compile Chromium easily.
I'm not sure why (although my best guess is it's something in the latest stable version of Chromium), but I've encountered many, many more kernel panics than usual over the past few days. This is good in some ways, because it means I may be able to capture useful data more easily.
I decided to switch back to the custom kernel for a time, now with the logging messages. @krackers, this is the code I ended up with: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d#file-kern_event-c-L112. I hope this was enough, including the other defines led to compile errors (which I didn't capture—sorry, I was actually trying to finish this before I have to leave for work).
In the time I've spent writing this comment, I actually may have already captured an averted panic! In the console I have:
3/4/22 2:39:26.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!
But, bad news: that's the only log I have! So either the wait_queue_is_valid
check is always returning true
(either because I coded it wrong or due to something else), or the call order is something completely different!
Anyway, just wanted to share, I'll look into this more later, gotta go teach Scratch programming to a bunch of children... 💨
@Wowfunhappy
Your code seems fine to me, so it seems like whatever is causing THREAD_RESTART
to be signaled is via a different code path. Just to be sure we didn't miss anything, can you add some more logging in all of the
if (!wait_queue_is_valid(wq))
return (thread->wait_result = THREAD_RESTART);
lines in https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335?
I.e. like you did before
if (!wait_queue_is_valid(wq)) {
printf("Func name + error message");
return (thread->wait_result = THREAD_RESTART);
}
Thanks for confirming the code looked fine! I've added that additional logging, see the new file in the gist: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d
Now we just need to wait for a log message to show up again...
The panic-averted log appeared once again:
3/4/22 7:45:05.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!
Once again, it was not accompanied by any of the other messages I added.
Hm well there goes that theory. Seems like it's not directly caused by a kqueue event then, it just happens when it's in the kqueue wait loop.
The 3 other places THREAD_ERROR
seems to be signalled are in vm_fault.c, sync_sema.c, and ipc_mqueue.c.
Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked
calls in osfmk/ipc/ipc_mqueue.c
? I don't know how noisy this will be though.
We can also try to determine this from the other direction: are there any plausible candidates for commits that would caused these panics to happen more frequently? Anything related to mach ipc or messagepump would be decent candidates.
Worst case I guess you could binary search for candidate commits, try 2 builds a day and you'd probably arrive at the change within a week or two.
Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked calls in osfmk/ipc/ipc_mqueue.c? I don't know how noisy this will be though.
Sure, done, but they're very noisy, running many times per second. I'll give this a try but I'm not sure how much useful information it will give us. And since logs roll off after 4,000 messages I'm not sure I'll be able to catch the critical THREAD_RESTART one...
Yeah that's what I feared. Ok I think we have 2 options going forward then
1) Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.
2) Other option is to patch this on kernel side via the kext approach mentioned before. I think we can just take a shortcut here and change the default
condition to just return EBADF
instead of panicking (as opposed to the "right" approach of inserting a new jump for a new case condition).
Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.
Thing is, the frequency is still erratic. Case in point, I haven't received that log message again since Friday. I don't actually know that the change was in Chromium as opposed to something in my usage pattern...
I found a thing. The more I think about it, the less I'm convinced it's related, but maybe @krackers can take a peek and decide if I may be onto something. :)
Google was running into kernel panics of their own back in 2017. Their kernel backtrace was different than ours, but it touches on same of the same functions: https://bugs.chromium.org/p/chromium/issues/detail?id=756102
Google fixed it here: https://chromium.googlesource.com/chromium/src.git/+/5cd8eef498e28e1aacb9109645c001bbbfe1bc26%5E%21/#F0
// On macOS 10.11+, using Mach port sets may cause system instability, per
// https://crbug.com/756102. On macOS 10.12+, a kqueue can be used
// instead to work around that. On macOS 10.9 and 10.10, kqueue only works
// for port sets, so port sets are just used directly. On macOS 10.11,
// libdispatch sources are used. Therefore, there are three different
// primitives that can be used to implement WaitMany. Which one to use is
// selected at run-time by OS version checks.
So, kPrimitive
is supposed to be set to:
KQUEUE
DISPATCH
PORT_SET
This was set by Chromium via:
const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : (mac::IsOS10_11() ? DISPATCH : PORT_SET);
However, Chromium Legacy uses:
const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : DISPATCH;
I think this is wrong, isn't it? We're using the 10.11 codepath on 10.7–10.10, whereas Google explicitly built something separate for 10.9–10.10, although it's not clear why. (Since official Chrome doesn't support anything older than 10.11, it seems this code is now only used for iOS?)
Unfortunately, it seems @blueboxd changed this early on to "fix instabilities on loading DevTool and plugins".
Assuming that's still needed... this whole Mac-specific WaitableEvent implementation is a relatively recent addition to Chromium, I wonder if we could just revert the whole thing and fall back to whatever cross-platform implementation Chromium was using before? Could be more maintainable anyway.
Hm it's an interesting find. The bug comments also suggest that watching videos (or doing something with GPU heavy activity) is more likely to trigger the bug.
Libdispatch does make use of kevent in its work loop from what I can see, so it's a decent guess. But I'm not really familiar with this part of osx internals so I cannot say for sure.
The comment is a bit hard to parse, but from the surrounding discussion it seems like
It's not clear why Google didn't use libdispatch on <= 10.9, but I suppose they wanted the minimal fix possible to avoid introducing any additional instability.
I'm not sure about the "fix instabilities on loading DevTool and plugins" thing. What kind of instability was seen, and how was it narrowed down to the WaitMany implementation?
The mac-specific WaitableEvent implementation seems to have been added in v62, and up to v68 officially supported 10.9 so we may not need to revert that entirely (note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that).
note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that
Because Chrome v62–68 set kPrimitive = PORT_SET
on 10.9/10.10, whereas Chromium Legacy does not. That was the basis of my theory, anyway—I'm still confident that if an official version of Chrome ever led to kernel panics on an officially-supported OS, we would know about it. Someone somewhere would have reported it to Google and they would have treated it as a serious issue.
The easiest and most obvious thing would be to match what Chrome does—except, Bluebox may have had a reason for changing it in the first place. Maybe related to Lion support? (Notably, we know the KP exists on Lion because @pjpreilly has seen it and runs Lion.)
I guess the only way to find out at this point is to try... I'll see if I can find some spare cpu time to recompile Chromium this week, since BlueBox seems to still be away (or has wisely chosen to stay out of this quagmire).
The good news is a re-built Chromium with kPrimitive = PORT_SET
on 10.7–10.10. The bad news is I triggered the would-be kernel panic in this version of Chromium. Oh well.
But now that I have Chromium building again, maybe I can experiment with some other stuff... I neglected to add the messagepumpkqueue
logging Krakers suggested, for example.
The next thing I want to try is wholesale getting rid of https://bugs.chromium.org/p/chromium/issues/detail?id=932175 and going back to the POSIX implementation. This change was made in 2019 after Mavericks support was dropped, and it clearly touches the kernel and kevent.
This is proving a bit challenging, as a lot of files were touched and things have been renamed, etc.
Okay, actual progress! I just noticed something I should have noticed earlier!
The log message I added to XNU which indicates an averted kernel panic is always immediately followed, within a second, by a crash from a Chromium Helper process! These crashes do not appear to have any user-visible effects, but they do provide a backtrace. See attached!
So, I can now say pretty confidently that the bug is somewhere in MessagePumpKqueue
.
Chromium Helper_2022-03-14-112231_Jonathans-Mac-Pro.crash.zip
sorry for the late response, that's a very interesting find. It means that the EBADF
code path is actually not supposed to happen at all, since their code only has checks for EINTR, hence the process crash. I don't know if this means that this is a case where on a 10.11 machine chromium helper process would have crashed as well, or whether it's a more fundamental bug in 10.9 mach ipc and adding the ebadf
case in the kernel source just works around the issue/shifts the symptom from "stop the world kernel panic" to process crash.
Either way as you mentioned it confirms that the kpanic is triggered while one thread is waiting on kevent inside MessagePumpKqueue
. It's still not clear though what's the other thread/process doing the action that causes the EBADF
to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes? Then again none of the log statements you added to the kernel got triggered so that means the EBADF is being signalled from either vm_fault.c
, sync_sema.c
, or ipc_mqueue.c
. But even if we narrow it down to one of those I wouldn't know how to find the corresponding userspace chromium code that results in that call since all of those are pretty broad (I still bet it's in ipc_mqueue
though).
I haven't looked too much into what it would take to revert the use of MessagePumpKqueue
but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO
, can we change that to the posix-generic messagepumplibevent ? Although I see that some of the other mac-specific parts have been switched to communicate using mach-ports which means we'd need to go back and revert those too...
It's still not clear though what's the other thread/process doing the action that causes the EBADF to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes?
The other thing I noticed was WaitableEvent (and specifically WaitableEvent::TimedWait) being near the top in several other threads. I still feel like that's implicated somehow, even though switching kPrimitive didn't resolve the issue.
I haven't looked too much into what it would take to revert the use of MessagePumpKqueue but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO, can we change that to the posix-generic messagepumplibevent ?
Yes, that's what I was trying to do last week before I noticed the process crash! The original set of changes was documented here. I had a bit of trouble setting it back to the POSIX version, as the changes touched a number of different files which have themselves evolved since 2019, but I'd like to try again at some point.
Thank you for the deep inspection of panic🙇
I traced this panic with kernel debugger before, but I couldn't find the root cause of panic. (there was nothing special about the parameters of kevent64 preceding the panic) Maybe it is not due to a single kevent64, but rather a combination of other events in the queue or running threads...?
@blueboxd Hey there! As a quick catch up, I recompiled XNU so that whenever kqueue_scan_continue
is called with a wait_result
value of THREAD_RESTART
, XNU will log a warning and return EBADF
instead of panicking.
This resolves the kernel panic, and causes Chromium to merely crash instead. I attached a backtrace here. It points to Chromium's MessagePumpKqueue
as the culprit.
I suspect that if we can fix that crash, the kernel panics will go away!
@Wowfunhappy Thank you for the catch-up!
Stack trace of kernel debugger also indicates kevent64
in MessagePumpKqueue::DoInternalWork
caused panic.
But, if parameters for kevent64
are not the root cause, how we can resolve panics/crashes...?
(simple re-writing of DoInternalWork
may not resolve the issue:thinking:)
I'll try to modify the kernel for debugging too.
@aeiouaeiouaeiouaeiouaeiouaeiou Chromium crashes are not the topic of this issue. (this issue induces kernel panic in normal environment) Please open a new issue with the crash log.
It might be worth trying to figure out where the EBADF is signaled from in the kernel (since we ruled out the origins in the kevent code, that leaves only vm_fault.c, sync_sema.c, or ipc_mqueue.c). As you mentioned, it's likely some race condition somewhere between different threads that's causing ebadf to sent to the thread while inside the kernel wait loop.
But, if parameters for kevent64 are not the root cause, how we can resolve panics/crashes...?
One thing we could try is using messagepumplibevent
instead of MessagePumpKqueue
. I think you'd just need to revert the set of changes described here: https://bugs.chromium.org/p/chromium/issues/detail?id=932175. The changes were a bit too complex for me, but you might have more success.
I've been avoiding reporting this, because I feel like it's going to be impossible to fix. :(
Occasionally while using Chromium Legacy, OS X kernel panics with a distinctive backtrace. The backtrace implicates Chromium Helper, and I've never seen the panic occur when I'm not using Chromium. I've observed this on my desktop and laptop running 10.9, as have others on MacRumors. I think it may happen more often on Macs with less memory?
A log is attached. Notably, I have so far never managed to capture a panic while I had
keepsyms
enabled, I'd like to capture one at some point.Kernel_2022-01-11-182654_Jonathans-MacBook-Air.panic.zip
Everything beyond this point is speculation, and probably useless, due to my limited understanding of C. But, I'm going to include it anyway. :)
I'm technically running XNU version 2422.115.15, which doesn't appear to be open source, but the closest version with source available is 2422.115.4. The file being referenced in the log is here: https://opensource.apple.com/source/xnu/xnu-2422.115.4/bsd/kern/kern_event.c.auto.html
The line numbers don't match up, possibly because of the tiny version mismatch, but I assume this is the referenced function:
So, the switch statement expects the
wait_result
parameter to be eitherTHREAD_AWAKENED
,THREAD_TIMED_OUT
, orTHREAD_INTERRUPTED
. If it's none of these—as appears to be the case for us—it panics with our error, "invalid wait result".So, what is being passed in as
wait_result
? Well, it's printing as the number "3", but I don't know enough C to understand how these integers are mapped to their human-readable names. However, note that a fourth option has been added to this switch statement in newer versions of XNU: https://opensource.apple.com/source/xnu/xnu-7195.81.3/bsd/kern/kern_event.c.auto.htmlI'm going to hazard a guess that when the kernel panics,
kqueue_scan_continue
is getting called with a value ofTHREAD_RESTART
in thewait_result
parameter, which is understood on new versions of XNU but not the older ones we're using. Is there any way to catch and handle this?