Chromium Legacy occasionally induces a kernel panic: kqueue_scan_continue: - invalid wait_result

blueboxd / chromium-legacy

Latest Chromium (≒Chrome Canary/Stable) for Mac OS X 10.7+

BSD 3-Clause "New" or "Revised" License

310 stars 17 forks source link

Chromium Legacy occasionally induces a kernel panic: kqueue_scan_continue: - invalid wait_result #44

Open Wowfunhappy opened 2 years ago

Wowfunhappy commented 2 years ago

I've been avoiding reporting this, because I feel like it's going to be impossible to fix. :(

Occasionally while using Chromium Legacy, OS X kernel panics with a distinctive backtrace. The backtrace implicates Chromium Helper, and I've never seen the panic occur when I'm not using Chromium. I've observed this on my desktop and laptop running 10.9, as have others on MacRumors. I think it may happen more often on Macs with less memory?

A log is attached. Notably, I have so far never managed to capture a panic while I had keepsyms enabled, I'd like to capture one at some point.

Kernel_2022-01-11-182654_Jonathans-MacBook-Air.panic.zip

Everything beyond this point is speculation, and probably useless, due to my limited understanding of C. But, I'm going to include it anyway. :)

I'm technically running XNU version 2422.115.15, which doesn't appear to be open source, but the closest version with source available is 2422.115.4. The file being referenced in the log is here: https://opensource.apple.com/source/xnu/xnu-2422.115.4/bsd/kern/kern_event.c.auto.html

The line numbers don't match up, possibly because of the tiny version mismatch, but I assume this is the referenced function:

static void
kqueue_scan_continue(void *data, wait_result_t wait_result)
{
    thread_t self = current_thread();
    uthread_t ut = (uthread_t)get_bsdthread_info(self);
    struct _kqueue_scan * cont_args = &ut->uu_kevent.ss_kqueue_scan;
    struct kqueue *kq = (struct kqueue *)data;
    int error;
    int count;

    /* convert the (previous) wait_result to a proper error */
    switch (wait_result) {
    case THREAD_AWAKENED:
        kqlock(kq);
        error = kqueue_process(kq, cont_args->call, cont_args, &count,
            current_proc());
        if (error == 0 && count == 0) {
            wait_queue_assert_wait((wait_queue_t)kq->kq_wqs,
                KQ_EVENT, THREAD_ABORTSAFE, cont_args->deadline);
            kq->kq_state |= KQ_SLEEP;
            kqunlock(kq);
            thread_block_parameter(kqueue_scan_continue, kq);
            /* NOTREACHED */
        }
        kqunlock(kq);
        break;
    case THREAD_TIMED_OUT:
        error = EWOULDBLOCK;
        break;
    case THREAD_INTERRUPTED:
        error = EINTR;
        break;
    default:
        panic("%s: - invalid wait_result (%d)", __func__,
            wait_result);
        error = 0;
    }

    /* call the continuation with the results */
    assert(cont_args->cont != NULL);
    (cont_args->cont)(kq, cont_args->data, error);
}

So, the switch statement expects the wait_result parameter to be either THREAD_AWAKENED, THREAD_TIMED_OUT, or THREAD_INTERRUPTED. If it's none of these—as appears to be the case for us—it panics with our error, "invalid wait result".

So, what is being passed in as wait_result? Well, it's printing as the number "3", but I don't know enough C to understand how these integers are mapped to their human-readable names. However, note that a fourth option has been added to this switch statement in newer versions of XNU: https://opensource.apple.com/source/xnu/xnu-7195.81.3/bsd/kern/kern_event.c.auto.html

    case THREAD_RESTART:
        error = EBADF;
        break;

I'm going to hazard a guess that when the kernel panics, kqueue_scan_continue is getting called with a value of THREAD_RESTART in the wait_resultparameter, which is understood on new versions of XNU but not the older ones we're using. Is there any way to catch and handle this?

pjpreilly commented 2 years ago

When I've triggered this...by chance.. I seem to always be switching windows either through hot corners invoking Mission Control or clicking the Chromium Dock Icon to switch windows..... FWIW.... or maybe switching tabs.... I've noticed sometimes when switching tabs the pointer does not take focus of the new tab until a subsequent click in the tab being switched to...seems to occur after the browser has been running for a good amount of time.

krackers commented 2 years ago

@Wowfunhappy Mapping is given in https://opensource.apple.com/source/xnu/xnu-4570.61.1/osfmk/kern/kern_types.h

You're right in that 3 corresponds to THREAD_RESTART. Not too familiar with kqueue internals but just based on statically analyzing the call hierarchy, seems to be

kevent_internal -> kqueue_scan -> kqueue_scan_continue

It's really interesting that kevent syscall was only recently updated to return EBADF though. Seems like this is returned whenever the fd becomes invalid (e.g. if it's closed), which would definitely explain why it's sporadic and hard to reproduce. I wonder if this bug was just never noticed on older versions of osx because nothing else other than chrome seems to make such heavy use of IPC, and older versions of Chromium probably used some other method for synchronization/ipc [1]?

There are only a few places in chromium source that make the keventsyscall (search for "kevent" "#include <sys/event.h>"). Might be interesting to add some debugging code around that to see if there's any pattern.

[1] https://source.chromium.org/chromium/chromium/src/+/2096391cdcc3bdf963e68977e9015aa8cdbe85c6

https://source.chromium.org/chromium/chromium/src/+/c96c632d5501d5ebba14e1130fdea8b15ac29756

Wowfunhappy commented 2 years ago

Thanks! I wonder if this was the underlying project which introduced the problem. The timing lines up, 9–12 months after 10.9 support was dropped from Chrome.

Aside, I assume it's not possible to just replace kqueue_scan_continue with a version that handles THREAD_RESTART, right? Because it's an internal kernel function, or something like that?

krackers commented 2 years ago

@Wowfunhappy Excellent question. It is possible if you are willing to install a kext. Basically since all the kexts will live in the same address space as the kernel code, you can hot-patch the kernel by modifying the instructions in memory.

Now to actually do this patch you'll need to be a bit clever:

Load the kernel mach-o in ghidra/hopper/ida and find the address of the function that we want to modify (kqueue_scan_continue). This will be our magic number that is the offset to patch from base.
Then in your kext, you will have to find the base location the kernel is loaded at (due to kaslr this is not fixed), and add that offset from step (1) to find the address of the desired function.
Add an extra offset to get to the specific instruction we want to patch. In this case, it's not a trivial modification since we want to insert a block of code, so the best way to do this is to insert a jump to code in your kext that will handle the switch(), then jump back to the original code.
Do the patch as described above, and it should all just work.

Conceptually not too difficult, and not that different from code injection in userspace. Only difference is that stakes are higher. If you have a VM and are willing to play around with it, I can give you some code pointers. Of course at this point it might just be easier to recompile the kernel itself since we have the source anyway...

Wowfunhappy commented 2 years ago

Thanks, makes sense it would need a kext. Probably not ideal if it can be helped.

(Just for my own use, I am a bit tempted to recompile XNU. But I run into this panic so rarely anyway... Some people on MacRumors were saying it happens daily for them.)

Wowfunhappy commented 2 years ago

MacBiter over on MacRumors captured a backtrace with keepsyms enabled.

panic(cpu 6 caller 0xffffff80055c7ab4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167
Backtrace (CPU 6), Frame : Return Address
0xffffff81f05a3ef0 : 0xffffff8005223139 mach_kernel : _panic + 0xc9
0xffffff81f05a3f70 : 0xffffff80055c7ab4 mach_kernel : _kqueue_scan + 0x8f4
0xffffff81f05a3fb0 : 0xffffff80052d7c67 mach_kernel : _call_continuation + 0x17

I assume this is _call_continuation? Still an internal kernel function, I naively thought it might give us Chromium's initial syscall.

Edit: And I think this is the definition... I wasn't expecting to find ASM...

Edit 2: I don't actually understand most of what it's saying, but pages 418 – 420 of Jonathan Levin's Mac OS X and iOS Internals may be relevant here. I think it's saying that wait_result is set by the thread_block_reason() function.

Screen Shot 2022-01-19 at 7 10 46 PM

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet. It would make sense if Chromium was the one setting wait_result, since Google is only targeting modern kernels. But this is kernel code calling kernel code. I guess the obvious answer is that it's a bug, but it seems like something that wouldn't have gone unnoticed for years.

Wowfunhappy commented 2 years ago

...well, for what it's worth, I decided to build a custom kernel. As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete. The only tricky part was that I had to install Xcode 5.0.2; later versions did not appear to work.

The only source change is that I added the THREAD_RESTART case to kqueue_scan_continue, which sets error = EBADF as in newer versions of XNU.

I am (a bit nervously) going to try using this on my main machine. A half hour in, nothing catastrophic has happened so far.

The kernel is attached in case anyone else on Mavericks wants to try it, but I am not necessarily recommending that anyone do so. If you do use it, make sure at minimum to have a secondary partition available to restore your original kernel should the need arise. Remember that you will need to clear the kernel cache on your first boot with the new kernel.

@krackers Do you have any idea why this kernel would be only 8,257,152 bytes, when the original kernel and every custom Hackintosh kernel I've checked is 8,394,688 bytes? It's weirding me out.

XNU-2422.115.4-THREAD_RESTART add-JA-v2022..01.22.zip

krackers commented 2 years ago

@Wowfunhappy sorry for the late response.

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet

Note that even in the older xnu version, THREAD_RESTART was still a valid defined value for wait_result_t. In fact, you can see that it is returned in wait_queue_assert_wait64 of osfmk/kern/wait_queue.c when the queue becomes invalid. The difference is that in this older kernel version, kqueue scan did not properly handle this, and just panicked instead of returning EBADF.

I agree that this seems way too obvious an omission to be a bug. But I can't think of any other explanation at the moment (that said, my knowledge here is very slim). I think the first step might be to see if we can create a scenario that can reliably reproduce this. Based on my current understanding, it seems like the conditions for triggering would look like:

Have one process call kqueue to create a new fd for event queue, and then call kevent to wait upon it
While that happens, another process closes the fd
Based on what's described above, this would generate a THREAD_RESTART condition in the wait queue code which would bubble up to kqueue_scan and cause the panic.

But this seems way too obvious and easy to trigger, so I'm still skeptical that this is actually what's happening.

As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete

I see you're following the "Kernel Shaman's" build instructions? I'm not quite sure why there would be a difference in size assuming that both are release builds, but it's possible that apple internal builds have some other ifdefs set during compile, or the kernel version is different? This wouldn't explain why the hackintosh kernels match in size though...

That's also why I'm personally partial to the "hotpatch via kext" approach. That way you introduce only the minimal difference necessary.

Wowfunhappy commented 2 years ago

@krackers I'm not ignoring your suggestions of things to try, they're just mostly outside of my skill level. 🙂 I only barely understand explanations of what a file descriptor is...

Anyway, the plot thickens. I neglected to check something earlier:

https://github.com/apple-oss-distributions/xnu/blame/cc9a63552421e7db025ad771974f3e9f479948ad/bsd/kern/kern_event.c

That THREAD_RESTART case was only added to XNU in 3789.1.32, aka macOS 10.12 Sierra.

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Could this indicate the problem was introduced by Chromium Legacy? If I'm not missing anything, I think the only Chromium Legacy changes that involve kevent come from this commit. https://github.com/blueboxd/chromium-legacy/commit/e0cff58db7f41e2b0b68e058ec49a98f215265c5

Google's code review comments for this change are interesting...

pjpreilly commented 2 years ago

On latest canary..Version 100.0.4889.0 (Official Build) (x86_64) two finger scrolling in a facebook messenger chat....

Tue Feb 15 06:40:24 2022 panic(cpu 0 caller 0x55bb8e): "kevent_scan_cont() - invalid wait_result (3)"@/SourceCache/xnu/xnu-1699.32.7/bsd/kern/kern_event.c:1982 Backtrace (CPU 0), Frame : Return Address (4 potential args on stack) 0xe2bf68 : 0x2203de (0x6b08cc 0xe2bf88 0x229fb0 0x0) 0xe2bf98 : 0x55bb8e (0x7006a4 0x3 0x6 0xa4c39d4) 0xe2bfc8 : 0x2c7d0c (0x9f03320 0x3 0x10 0xa4e9190)

BSD process name corresponding to current thread: Chromium Helper

Mac OS version: 11G63

Kernel version: Darwin Kernel Version 11.4.2: Thu Aug 23 16:26:45 PDT 2012; root:xnu-1699.32.7~1/RELEASE_I386 Kernel UUID: 859B45FB-14BB-35ED-B823-08393C63E13B System model name: MacBook4,1 (Mac-F22788A9)

System uptime in nanoseconds: 71096824466362 last loaded kext at 27108605406: com.apple.driver.AudioAUUC 1.59 (addr 0x146b000, size 24576) last unloaded kext at 115835844270: com.apple.driver.USBCameraFirmwareLoader 1.1.0 (addr 0xd35000, size 24576) loaded kexts: com.apple.driver.AudioAUUC 1.59 com.apple.filesystems.autofs 3.0 com.apple.driver.AppleHWSensor 1.9.5d0 com.apple.driver.AppleHDA 2.2.5a5 com.apple.driver.AppleUpstreamUserClient 3.5.9 com.apple.driver.AppleBacklight 170.2.2 com.apple.driver.AppleMCCSControl 1.0.33 com.apple.driver.AppleIntelGMAX3100 7.0.4 com.apple.driver.AppleIntelGMAX3100FB 7.0.4 com.apple.driver.SMCMotionSensor 3.0.2d6 com.apple.iokit.IOUserEthernet 1.0.0d1 com.apple.iokit.IOBluetoothSerialManager 4.0.8f17 com.apple.Dont_Steal_Mac_OS_X 7.0.0 com.apple.driver.AudioIPCDriver 1.2.3 com.apple.driver.ApplePolicyControl 3.1.33 com.apple.driver.ACPI_SMC_PlatformPlugin 5.0.0d8 com.apple.driver.AppleLPC 1.6.0 com.apple.driver.AppleSMCPDRC 5.0.0d8 com.apple.driver.CSRUSBBluetoothHCIController 4.0.8f17 com.apple.AppleFSCompression.AppleFSCompressionTypeDataless 1.0.0d1 com.apple.AppleFSCompression.AppleFSCompressionTypeZlib 1.0.0d1 com.apple.BootCache 33 com.apple.driver.AppleUSBTrackpad 227.6 com.apple.driver.AppleUSBTCKeyEventDriver 227.6 com.apple.driver.AppleUSBTCKeyboard 227.6 com.apple.driver.AppleIRController 312 com.apple.iokit.SCSITaskUserClient 3.2.1 com.apple.driver.XsanFilter 404 com.apple.iokit.IOAHCIBlockStorage 2.1.0 com.apple.driver.AirPortBrcm43224 501.36.15 com.apple.driver.AppleFWOHCI 4.9.0 com.apple.driver.AppleHPET 1.7 com.apple.driver.AppleUSBHub 5.1.0 com.apple.driver.AppleAHCIPort 2.3.1 com.apple.driver.AppleIntelPIIXATA 2.5.1 com.apple.driver.AppleUSBEHCI 5.1.0 com.apple.driver.AppleUSBUHCI 5.1.0 com.apple.driver.AppleEFINVRAM 1.6.1 com.apple.driver.AppleRTC 1.5 com.apple.iokit.AppleYukon2 3.2.2b1 com.apple.driver.AppleSmartBatteryManager 161.0.0 com.apple.driver.AppleACPIButtons 1.5 com.apple.driver.AppleSMBIOS 1.9 com.apple.driver.AppleACPIEC 1.5 com.apple.driver.AppleAPIC 1.6 com.apple.driver.AppleIntelCPUPowerManagementClient 195.0.0 com.apple.nke.applicationfirewall 3.2.30 com.apple.security.quarantine 1.4 com.apple.security.TMSafetyNet 8 com.apple.driver.AppleIntelCPUPowerManagement 195.0.0 com.apple.kext.triggers 1.0 com.apple.driver.DspFuncLib 2.2.5a5 com.apple.driver.AppleBacklightExpert 1.0.4 com.apple.driver.AppleSMBusController 1.0.10d0 com.apple.driver.AppleHDAController 2.2.5a5 com.apple.iokit.IOHDAFamily 2.2.5a5 com.apple.iokit.IOSurface 80.0.2 com.apple.iokit.IOFireWireIP 2.2.5 com.apple.iokit.IOAudioFamily 1.8.6fc18 com.apple.kext.OSvKernDSPLib 1.3 com.apple.driver.AppleGraphicsControl 3.1.33 com.apple.iokit.IONDRVSupport 2.3.4 com.apple.iokit.IOGraphicsFamily 2.3.4 com.apple.iokit.IOSerialFamily 10.0.5 com.apple.driver.AppleSMC 3.1.3d10 com.apple.driver.IOPlatformPluginLegacy 5.0.0d8 com.apple.driver.IOPlatformPluginFamily 5.1.1d6 com.apple.driver.AppleUSBBluetoothHCIController 4.0.8f17 com.apple.iokit.IOBluetoothFamily 4.0.8f17 com.apple.driver.AppleUSBMergeNub 5.1.0 com.apple.iokit.IOUSBHIDDriver 5.0.0 com.apple.driver.AppleUSBComposite 5.0.0 com.apple.iokit.IOSCSIMultimediaCommandsDevice 3.2.1 com.apple.iokit.IOBDStorageFamily 1.7 com.apple.iokit.IODVDStorageFamily 1.7.1 com.apple.iokit.IOCDStorageFamily 1.7.1 com.apple.iokit.IOATAPIProtocolTransport 3.0.0 com.apple.iokit.IOSCSIArchitectureModelFamily 3.2.1 com.apple.iokit.IO80211Family 420.3 com.apple.iokit.IOFireWireFamily 4.4.8 com.apple.iokit.IOUSBUserClient 5.0.0 com.apple.iokit.IOAHCIFamily 2.0.8 com.apple.iokit.IOATAFamily 2.5.1 com.apple.iokit.IOUSBFamily 5.1.0 com.apple.driver.AppleEFIRuntime 1.6.1 com.apple.iokit.IONetworkingFamily 2.1 com.apple.iokit.IOHIDFamily 1.7.1 com.apple.iokit.IOSMBusFamily 1.1 com.apple.security.sandbox 177.11 com.apple.kext.AppleMatch 1.0.0d1 com.apple.driver.DiskImages 331.7 com.apple.iokit.IOStorageFamily 1.7.2 com.apple.driver.AppleKeyStore 28.18 com.apple.driver.AppleACPIPlatform 1.5 com.apple.iokit.IOPCIFamily 2.7 com.apple.iokit.IOACPIFamily 1.4

jgrisham commented 2 years ago

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Do those (macOS 10.11 Chrome users) actually exist? ;)

Even last April, macOS 10.10 had a much larger share than 10.11.

blah...
Those still running 10.11 may be using Safari/Firefox/Opera, or blame their kernel panics on running an old OS version.
(I also suspect many are headless or server systems that haven't been updated - these types of systems probably don't see much interactive web browsing.)
Personally, my MacBook Pro still is running Mavericks, but I seem to remember thinking (for some long-lost rationale) that if I were ever to bump it forward, I'd be stopping at 10.10, 10.12, 10.13, or 10.15.

~~AFAIK, there aren't many (any?) macs where that is the latest installable version.~~

Systems tend to max out on 10.4, 10.6.8, 10.7.5, 10.11, 10.13.x, etc...

Having a max supported OS of macOS 10.11 seems mostly limited to MacMini3,1s and other similar late 2008/early 2009 Penryn/Nehalem models ... the late 2009 models tend to support up to 10.13.x ... I don't remember exactly, but the demarcation point may have been 'Metal' framework support by the integrated GPU.

more blah...

We recently identified a MS Outlook / Windows MAPI bug that now silently eats emails... but only for IMAP mailboxes, and only for maybe one out of a thousand. It has existed for _at least_ a decade, likely two. _(It was likely dormant for most of that time; surfacing only as mail processors began to support [ARC](https://en.wikipedia.org/wiki/Authenticated_Received_Chain) in late 2019/2020.)_ - From the public reporting, it wouldn't be _at all surprising_ if Apple had overlooked intermittent kernel bugs for just as long, _especially_ for one that didn't generally cause panics on iOS / TVOS. - The larger the company _(and the older/more stable the codebase)_, the harder it is for even well-meaning and talented engineering teams to find out edge cases exist - there are unintentional gatekeepers at all levels.

Tl;dr: it's plausible that it had been broken all along (i.e. not backported to earlier versions when it was finally discovered / fixed in XNU 3789.1.32).

Wowfunhappy commented 2 years ago

Do those (macOS 10.11 Chrome users) actually exist? ;)

I mean, anecdotally, I've seen several of them in the wild! I even helped a third grade schoolteacher add the Let's Encrypt certificate to Keychain Access last fall.

This may well be an kernel bug^†, but I don't think it's happening in official versions of Chrome. A kernel panic is more difficult to overlook than a missing email, and this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently. I don't see how Google, with all of their users and analytics, wouldn't know about this problem if it existed.

It seems much more likely to me that https://github.com/blueboxd/chromium-legacy/commit/e0cff58db7f41e2b0b68e058ec49a98f215265c5 is culpable, somehow...

^{† Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.}

jgrisham commented 2 years ago

{discussion clutter}

> Technically, my understanding is that _any_ instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel. That's what I always thought too... - anything else seems like users and userspace devs are just being gaslit by kernel devs _(I guess it's technically only gaslighting if it's intentional)_

> A kernel panic is more difficult to overlook ... Indeed. > this one appears to happen to _all_ Chromium Legacy users sooner or later, albeit very infrequently Do we actually know that? I mean, [how could we](https://hanadigital.github.io/grev/?user=blueboxd&repo=chromium-legacy) without a major analytics operation? _(I'm honestly curious - not trying to be argumentative!)_ ... yet, someone who recompiles XNU for fun in their spare time also said: > I've been avoiding reporting this ... `Chromium` is one of the most resource-intensive users of a heterogenous variety of OS resources that many users are likely to run these days. - _It's also liable to be used in very different ways on each launch; even users visiting the same website day after day will ultimately be running different JS from one month to the next._

> with all of their users and analytics I _really_ want to believe, but for something that doesn't significantly impact PR or ad revenue, causing occasional and non-reproducible crashes for users of older versions of only one operating system? - Simple things break complex systems, and yet only seem simple in retrospect. - It would _not be at all surprising_ if they had much more impactful bugs languishing in their backlog, both _then and now_. Knowledge without action and all...

Consider:

Apple fixed this at the kernel level (after a few years)
You fixed this at the kernel level (after a few weeks)

Also:

None of that actually matters 😒

Neither one of those fixed kernels will get to the majority of affected users here, so we do have to continue trying to mitigate this kernel bug by rolling back or modifying our userspace code.

Tl;dr: Thank you for trying to hunt it down!

krackers commented 2 years ago

Looking again at the kernel kevent code, it seems there are very few places where THREAD_RESTART is actually signaled. The most likely source for the one received in kqueue_scan seems to be

https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335

    /* If it is an invalid wait queue, you cant wait on it */
    if (!wait_queue_is_valid(wq))
        return (thread->wait_result = THREAD_RESTART);

and in the kevent code kq->kq_wqs is the wait queue as part of the kq struct. This again makes me think that somehow the kqueue is being closed by some other process and we're not handling that.

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

As for userspace workarounds, assuming that hypothesis is correct then what I can think of is to try to ensure that no thread is waiting on a kqueue that is to be closed. So we'd have to send some sort of interrupt signal to break the wait on kqueue before closing the fd? Not too sure since I'm not too familiar with these parts...

As for the reverted commit you linked, at first glance I don't see anything too suspicious there. Why did that commit need to be reverted for 10.9 compatibility though? I don't see anything that's 10.11+ specific.

Wowfunhappy commented 2 years ago

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

I'll try next week!

As for the reverted commit you linked, at first glance I don't see anything too suspicious there.

I don't understand this code, so I'm just looking at the timing of when things were added and the differences between codebases. I'm assuming:

The offending code has to make use of the kevent system call.
There has to be something different in Chromium Legacy's codebase and mainline Chromium's codebase, which is causing Chromium Legacy to kernel panic on 10.11 and below but not mainline Chromium. (This is based on a separate assumption that Chromium Legacy would kernel panic if used on 10.11, since 10.11 contains the offending XNU code.)

The commit I linked contains the only difference I could find between the two codebases which uses kevent.

krackers commented 2 years ago

@Wowfunhappy Assumption 1 isn't necessarily true. If we use the working hypothesis that the kpanic is caused by attempting to use an invalid (freed) kqueue, then you'd also have to search for references for code which uses MessagePump, since when that instance goes out of scope its destructor will free the kqueue fd.

There's also exactly 1 other place where THREAD_RESTART is signalled, and that's in some sephamore related method. If the kpanic is triggered by that, then we're really in a pickle since that would mean the root cause might not even be strictly related to kqueue operations.

Wowfunhappy commented 2 years ago

@krackers I ran into some stupid trouble guarding the logging statements. wait_queue_is_valid doesn't exist in kern_event.c.

I tried to copy the two defines from: https://github.com/aosm/xnu/blob/0a13c7a4dcb9ab989727c2b0e08f7861457ed069/osfmk/kern/wait_queue.h#L161, adding:

#define _WAIT_QUEUE_inited      0x2
#define wait_queue_is_valid(wq) \
    (((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

But the compiler complained: no member named 'wq_type' in 'struct wait_queue'

The wait_queue struct is defined in kern/wait_queue.h, and wq_type is ostensibly public, but adding #include <kern/wait_queue.h> to kern_event.c had no effect. I don't want to copy-paste the struct into kern_event.c because I don't know if defining it in a new place will do something weird.

I'm sure I'm doing something dumb. 🙂

For now, I've attached a kernel with unguarded logging statements before each call of wait_queue_assert in kern_event.c. The source is this revision of: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d/440e40ec82da87f4240f22981a020e0f169ea915. It works in my VM, but it's noisy enough that I don't want to leave it running on my real system. It seems wait_queue_assert_wait_with_leeway is called from kqueue_scan many times per second.

Also for reference, the guarded version I can't compile: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558 XNU-2422.115.4-THREAD_RESTART add-JA-v2022.02.21.zip d

Edit: Issue was closed by accident, please disregard.

krackers commented 2 years ago

@Wowfunhappy

No you're not doing something wrong, the way their includes work is tricky and took me a good 10 minutes to figure out.

Very roughly, when all the headers for are put together the net result is something like


#include <stdio.h>

typedef struct wait_queue       *wait_queue_t;
#ifndef MACH_KERNEL_PRIVATE

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types

#else

typedef struct wait_queue { // from wait_queue.h
    unsigned long int                 
    /* boolean_t */ wq_type:2,      /* only public field */
                    wq_fifo:1,      /* fifo wakeup policy? */
                    wq_prepost:1,   /* waitq supports prepost? set only */
                    wq_eventmask:2; 
    int wq_interlock;   /* interlock */
    int wq_queue;       /* queue of elements */
} WaitQueue;

#define wait_queue_is_valid(wq) \
    (((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

#endif

In particular, note that we actually have two different declarations of the wait_queue struct that are selected based on whether MACH_KERNEL_PRIVATE. I assume one is intended be used akin to an opaque-type (think "private" in an OO language) to hide internal members, while the other is used only within the wait_queue.c. Interestingly for opaque types you normally just leave the declaration without any fields, but here they explicitly initialize with some dummy fields of the proper size, which I assume was done so sizeof() would work propery.

This is also why just #including didn't work, since the proper #define wasn't set, so it effectively never made it past the preprocessor.

Now let's work around this. You're right in that we don't want to just directly copy-paste the struct since we'll run into duplicate definition error. Instead let's do it in a slightly smarter way where we create our own struct with identical layout and then cast to that to access fields.


typedef struct wait_queue       *wait_queue_t;

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types

#define EVENT_MASK_BITS  

typedef struct wowfunhappy_wait_queue { // happy little wait queue
    unsigned long int                 
    /* boolean_t */ wq_type:2,      /* only public field */
                    wq_fifo:1,      /* fifo wakeup policy? */
                    wq_prepost:1,   /* waitq supports prepost? set only */
                    wq_eventmask:((sizeof(long) * 8) - 4); 
} WowFunHappyWaitQueue;

#define _WAIT_QUEUE_inited      0x2
#define wait_queue_is_valid(wq) \
    (((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

void buzz(wait_queue_t wq) {
    printf("%d", wait_queue_is_valid((WowFunHappyWaitQueue*) wq));
}

** Note 2 important things. 1) This is technically undefined behavior I think, although if you were to use a union instead of a cast to do the type-punning then it might be standards compliant? But who cares about standards so long as it works (and casting should indeed work).

2) Note that I only included the initial member of the struct here, and left out the other 2. This should be fine because IIRC type-punning is safe for any prefix of shared fields as well. **

Wowfunhappy commented 2 years ago

Unfortunately, I have been forced to revert back to a vanilla XNU kernel on my personal machine.

I woke up this morning and found that my Mac had been logged out of iMessage. When I tried to log back in, the app complained of an unknown error. After a lot of reboots, password changes, network changes, and other dead ends, I finally tried reverting back to stock XNU, and iMessage instantly began working again.

If anyone else is using one of the kernels I attached, and iMessage stops working for you, you know how to fix it. And for what it's worth, I experienced zero panics during my month with the custom kernel.

Stuff related to iMessage and custom kernels

I'm not entirely sure what happened, since I was using my custom kernel for nearly a month without problems. It may be that because I was already logged into iMessage when I switched kernels, the app kept working up until Apple sent some kind of heartbeat message. The [imessage_debug](https://www.insanelymac.com/forum/files/file/278-imessage_debug/) utility reported failures when retrieving the values of `Gq3489ugfi`, `Fyp98tpgj`, `kbjfrfpoJU`, `oycqAZloTNDm`, and `abKPld1EcMni` on the custom kernel (and success with the stock kernel), but the software's authors neglected to include any documentation as to what these values mean, where they come from, or why they're important. 🤷‍♂️ Regardless, I personally value iMessage compatibility, and the system is enough of a crapshoot that I'd rather not mess with it and potentially get my account flagged. Perhaps I'll see if I can figure out a kext patch instead.

krackers commented 2 years ago

That's interesting. Based on what I found online it seems cause is

Publicly-available XNU sources lack the necessary functions needed for generating the IOPower data (just look at IORootParent::initialize() in the kernel binary vs. what's in the source).

I assume that the hackintosh community has worked around this by just doing patching instead. There seems to be a robust patching framework they've built up (Lilu). Another alternative might have been to create a kext that does populate these values in the io registry, but I don't know why they didn't go that route. I'm not too familiar with the hackintosh scene so I don't know how they obtain these values in the first place if they don't have an existing mac to copy from (basedo n https://dortania.github.io/OpenCore-Post-Install/universal/iservices.html#fixing-en0 looks like they generate random SNs??)

I personally don't like the idea of using a monolithic framework to do the patching (and Lilu is bloated since it tries to handle both userspace and kernel patching). Looking at ([Onyx the black cat[(https://github.com/gdbinit/onyx-the-black-cat)) is a good reference since it contains some useful functions we might want (especially the write enable on kernel pages). Most of those patches are in-place binary patches though, whereas for a slightly more involved patch like this I'd usually want to insert a trampoline.

I tried to find any existing examples of kexts which have a trampoline insert that we can steal (there are many userspace ones like mach_override that we can port if really needed since the core logic is the same, but I'm lazy and don't want to think too much). I found

https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/inline-hook/inline_hook.c

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

https://github.com/SilverMoonSecurity/PassiveFuzzFrameworkOSX/blob/be29a4df6e7b2394774a8cc3019bde03d816158c/src/kernel_fuzzer/the_flying_circus/InlineHook/inline_hook.c

(interestingly all same to be the same-ish file, but don't have any attribution...)

krackers commented 2 years ago

Er actually doing a trampoline is going to be a bit annoying because we need to ensure the return value is actually used. We can either implement the trampoline'd function in assembly so we don't need to worry about C messing things up, try to fuss around with the proper syntax to ensure that wait_result gets mapped to the function param and the return value gets mapped to the register that contains error, or just forget about doing this properly and patch the default section to return EBADF directly and hope that this doesn't mask any other bug.

In terms of hacker spirit I still like the trampoline though, and if we set up the framework for that it will be easy to insert the logging I mentioned. After going through the candidates I like the approach taken by

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

since it is the easiest for me to understand and flexible enough to coax into doing what we want.

@Wowfunhappy Either way (whether or not we opt to use trampoline) let's start off with creating a simple kext that just reads the contents of kqueue_scan_continue function address to make sure we are accessing the correct address. https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c has a nifty solve_kernel_symbol function but I don't think we can use it because kqueue_scan_continue is not exported function. Instead, we need to find that corresponding block in the disassembled binary to get the relative address (relative to kernel base) and then add the kaslr offset to it to get the absolute address in memory.

Once you have the relative address, can you create a basic "hello world" kext (https://craftware.xyz/tips/Building-kernel-extension.html) that uses the kaslr_slide function from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c to dump out some bytes from the absolute address to see if they match the assembly we expect? Once we have that then we can take the next steps...

Wowfunhappy commented 2 years ago

The Hackintosh community has some good reasons for avoiding custom kernels. Apple releases source code on a delay, and if you're running a non-ancient version of macOS, every update will potentially overwrite your custom kernel with a stock one, and leave you with a non-bootable system if you aren't careful. The more recent OSS releases also leave out really critical things like sleep and power management.

I'd thought that on an EOL OS, there was a certain purity to just fixing an obvious bug in an open-source component and recompiling—after all, what good is code that's never built? But it was never a great solution anyway, and it appears that even Apple's older code releases aren't as complete as I thought.

I am a Hackintosh user (on one of my two machines) and I'm already using Lilu, so it might make sense for me to go that route; it mostly depends on what I can actually manage to do. Downside: AIUI, Lilu has to be injected via a Hackintosh bootloader, because it needs to be loaded before the kexts it's patching. (Although, wait a minute, we're not trying to patch a kext, we're trying to patch the kernel—will this work at all? I know you said it's all the same address space, but then why does the loading order matter...?)

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users. The custom kernel had the advantage of being easy, I was done in a couple hours.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in userspace?

Edit: I started writing this before your more recent replies/edits @krackers. Okay, if I attempt to keep going with this I'll follow from your starting point, thank you! I would still like to step back for a minute and consider whether this a smart avenue to keep exploring—what are we hoping to find out?

krackers commented 2 years ago

@Wowfunhappy

Yes, Lilu will work since it has support for patching kernel functions as well. But I don't like it here because I don't think using Lilu is going to give us any strong benefit that we could not have ourselves by copying a few header files for useful functions. And the Lilu only takes care of boilerplate, but we must still bring core logic. And that too we'd have to learn the Lilu API so the benefits of having boilerplate taking care of is negated. I think it's much simpler and more understandable to just make vanilla kext?

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users.

End result will be to create a kext that users can just place in /Library/Extensions or whatever to hotpatch kernel on the fly. Will only work for users with same kernel version that you based offsets on though, other kernel versions people will have to recompile kext with their own offset. It's also fun for learning, but yes it is a bit more work to do than just recompiling xnu. Up to you whether the pain of occasional kernel panic is worth spending a few weekends trying to create a kext though. Happy to provide guidance along the way. I edited previous repose https://github.com/blueboxd/chromium-legacy/issues/44#issuecomment-1048316473 with some concrete starting steps if you've decided to take this route.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in user-space?

To be honest, I don't know. If we can determine for sure that it is caused by a freed kqueue fd, then we need to take next step to figure out why the fd is being freed. This would involve adding some logging code in Chromium (maybe in destructor for messagepump?). If we verify that is indeed the cause, then next step would be to create some work around where we try to only free the kqueue after we ensure no thread is blocked waiting on one. We could try to make that change now itself but to me seems like the effort required to figure out how to implement this (would have to study exactly what type of event to send to interrupt existing kqueue wait) combined with effort to recompile chromium is much more than doing the kext.

Other option is if you can try to find commits involving messagepump we can try to see if one of them might be responsible, but this is more of a shot in the dark compared to the kernel patch which we know will work. If you're optimizing for time spent vs. reward then the kernel patch approach seems to be a surer bet.

krackers commented 2 years ago

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner. Can we run chrome under gdb or with dtrace to figure out where the kevent syscall is being made?

Tangential thing

Completely unrelated, but if you like reading tales about old versions of chrome and patching things, did you know that newer version of chrome actually fixed (well more like worked-around) a coretext bug in 10.9? Under very rare and infrequent conditions (usually when rendering CJK text + emoji, but not deterministically) coretext shaper will get stuck when using AAT font, and just go in an infinite loop. You can see in https://github.com/harfbuzz/harfbuzz/blob/main/src/hb-coretext.cc#L187 that google actually discovered this issue and added a workaround to coretext in v58 or so. But this workaround is not perfect, and there are still conditions which cause this (seems to be a combination of emoji and kanji characters from what I've seen). I actually went a wild chase some time back trying to see if I could fix this myself by hooking and hotpatching that function to determine which parameters caused coretext to crash. I could have recompiled Chromium but figuring out how to build chromium seemed harder than just hotpatching. In fact even figuring out how to parse the crash dump format from non-debug builds (minidump format) was quite an exercise. As you know, for security chrome divides the main browser process from the helper processes, which are spawned from a different binary. SIMBL works fine to inject into the main chrome process (surprisingly, Chrome is alright with the injection. I would have thought that they do crazy tricks to avoid process injection like in windows, but seems like not.), but since chrome helper does not load any cocoa libraries we cannot simbl inject into that. We can't add dylib to the helper binary because it is codesigned with `CS_RESTRICT` (and removing codesign here seemed like a bad idea) since there are some entitlements it needs I think, but helper binary also loads chrome framework dylib (manually via `dlopen` actually, not via usual dynamic linking), so we _can_ actually modify that chrome framework without tripping any security issues. I tried this and it worked, but changing signature does prevent some keychain related things from working which I didn't like. I finally realized that what I really want is to use `mach_inject` automatically for every helper process spawned. Newer versions of macforge do this, but for 10.9 mysimbl only does traditional osax injection.

(Btw, @wowfunhappy I noticed many of your plugins usually done via insert_dylib approach, you should consider packaging as a simbl plugin instead since it's much easier to enable/disable and you don't break signatures).

Continue tangential thing

So I thought some more and realized that to do `mach_inject` we only really a way to look for pid of spawned process. Turns out that chrome launches the helper via `posix_spawnp`, and we can hook this call via standard patching techniques. So in my mind I constructed this wacky chain where I'd simble inject into the main chrome process, then hotpatch to interpose `posix_spawnp`, then take the pid result from that call and call `mach_inject` to inject a blob into the helper, and the injected blob would finally hotpatch `harfbuzz_shape`. _Of course after more thinking I realized this was absolutely insane and I just removed the problematic emoji that was triggering the crash from my webpage. _ EDIT: Fwiw I just realized a way to do this is via DYLD_INSERT_LIBRARIES, it automatically gets propagated to all spawned child processes, and does not break code signing. You can then use mach_override or any other plt hooking library. Oh so back to how newer version of Chromium fixed it. Turns out that newer versions of harfbuzz don't call into CoreText at all to shape AAT fonts, so we sidestep this problem entirely. Very fortunate, and it probably also avoids a whole class of security vulnerabilities. Anyway, thank you for reading.

Wowfunhappy commented 2 years ago

@krackers I'm going to continue the tangential discussion a bit here to try to leave this issue as focused as possible.

Wowfunhappy commented 2 years ago

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner.

The crash log is pretty explicit though:

BSD process name corresponding to current thread: Chromium Helper

If it was a system framework, wouldn't that be showing up instead?

I also took a quick look in Instruments, and can confirm that Chromium Helper is certainly making BSC_kevent64 syscalls. Whether these are the syscalls that sometimes lead to a panic, I have no idea.

Edit: All coming from one place: Screen Shot 2022-02-27 at 10 46 30 AM

krackers commented 2 years ago

If it was a system framework, wouldn't that be showing up instead

I think even if call into system framework, it's still part of same chrome helper process?

But yes based on your trace seems like it is indeed chromium message pump. Can you try recompiling chromium add adding some logging in the destructor for messagepumpkqueue (assuming that is the class which holds the kqueue reference)? I assume that it's one messagepump per chrome helper process, which means it might not be too noisy to run it and see if object destroy is related to kernel panic trigger.

Wowfunhappy commented 2 years ago

I'll try if I can! It may be a while, grad school is picking up so I'm trying (key word) to spend less time screwing around with computers.

Bluebox, if you happen to read this, (I assume you're busy with other life stuff), you may want to add this logging yourself since you have the hardware to compile Chromium easily.

Wowfunhappy commented 2 years ago

I'm not sure why (although my best guess is it's something in the latest stable version of Chromium), but I've encountered many, many more kernel panics than usual over the past few days. This is good in some ways, because it means I may be able to capture useful data more easily.

I decided to switch back to the custom kernel for a time, now with the logging messages. @krackers, this is the code I ended up with: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d#file-kern_event-c-L112. I hope this was enough, including the other defines led to compile errors (which I didn't capture—sorry, I was actually trying to finish this before I have to leave for work).

In the time I've spent writing this comment, I actually may have already captured an averted panic! In the console I have:

3/4/22 2:39:26.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!

But, bad news: that's the only log I have! So either the wait_queue_is_valid check is always returning true (either because I coded it wrong or due to something else), or the call order is something completely different!

Anyway, just wanted to share, I'll look into this more later, gotta go teach Scratch programming to a bunch of children... 💨

krackers commented 2 years ago

@Wowfunhappy

Your code seems fine to me, so it seems like whatever is causing THREAD_RESTART to be signaled is via a different code path. Just to be sure we didn't miss anything, can you add some more logging in all of the

    if (!wait_queue_is_valid(wq))
        return (thread->wait_result = THREAD_RESTART);

lines in https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335?

I.e. like you did before

    if (!wait_queue_is_valid(wq)) {
               printf("Func name + error message");
        return (thread->wait_result = THREAD_RESTART);
        }

Wowfunhappy commented 2 years ago

Thanks for confirming the code looked fine! I've added that additional logging, see the new file in the gist: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d

Now we just need to wait for a log message to show up again...

Wowfunhappy commented 2 years ago

The panic-averted log appeared once again:

3/4/22 7:45:05.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!

Once again, it was not accompanied by any of the other messages I added.

krackers commented 2 years ago

Hm well there goes that theory. Seems like it's not directly caused by a kqueue event then, it just happens when it's in the kqueue wait loop.

The 3 other places THREAD_ERROR seems to be signalled are in vm_fault.c, sync_sema.c, and ipc_mqueue.c.

Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked calls in osfmk/ipc/ipc_mqueue.c? I don't know how noisy this will be though.

We can also try to determine this from the other direction: are there any plausible candidates for commits that would caused these panics to happen more frequently? Anything related to mach ipc or messagepump would be decent candidates.

krackers commented 2 years ago

Worst case I guess you could binary search for candidate commits, try 2 builds a day and you'd probably arrive at the change within a week or two.

Wowfunhappy commented 2 years ago

Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked calls in osfmk/ipc/ipc_mqueue.c? I don't know how noisy this will be though.

Sure, done, but they're very noisy, running many times per second. I'll give this a try but I'm not sure how much useful information it will give us. And since logs roll off after 4,000 messages I'm not sure I'll be able to catch the critical THREAD_RESTART one...

krackers commented 2 years ago

Yeah that's what I feared. Ok I think we have 2 options going forward then

1) Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.

2) Other option is to patch this on kernel side via the kext approach mentioned before. I think we can just take a shortcut here and change the default condition to just return EBADF instead of panicking (as opposed to the "right" approach of inserting a new jump for a new case condition).

Wowfunhappy commented 2 years ago

Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.

Thing is, the frequency is still erratic. Case in point, I haven't received that log message again since Friday. I don't actually know that the change was in Chromium as opposed to something in my usage pattern...

Wowfunhappy commented 2 years ago

I found a thing. The more I think about it, the less I'm convinced it's related, but maybe @krackers can take a peek and decide if I may be onto something. :)

Google was running into kernel panics of their own back in 2017. Their kernel backtrace was different than ours, but it touches on same of the same functions: https://bugs.chromium.org/p/chromium/issues/detail?id=756102

Google fixed it here: https://chromium.googlesource.com/chromium/src.git/+/5cd8eef498e28e1aacb9109645c001bbbfe1bc26%5E%21/#F0

   // On macOS 10.11+, using Mach port sets may cause system instability, per
   // https://crbug.com/756102. On macOS 10.12+, a kqueue can be used
   // instead to work around that. On macOS 10.9 and 10.10, kqueue only works
   // for port sets, so port sets are just used directly. On macOS 10.11,
   // libdispatch sources are used. Therefore, there are three different
   // primitives that can be used to implement WaitMany. Which one to use is
   // selected at run-time by OS version checks.

So, kPrimitive is supposed to be set to:

10.12 and newer: KQUEUE
10.11 exactly: DISPATCH
10.10 and older: PORT_SET

This was set by Chromium via: const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : (mac::IsOS10_11() ? DISPATCH : PORT_SET);

However, Chromium Legacy uses: const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : DISPATCH;

I think this is wrong, isn't it? We're using the 10.11 codepath on 10.7–10.10, whereas Google explicitly built something separate for 10.9–10.10, although it's not clear why. (Since official Chrome doesn't support anything older than 10.11, it seems this code is now only used for iOS?)

Unfortunately, it seems @blueboxd changed this early on to "fix instabilities on loading DevTool and plugins".

Assuming that's still needed... this whole Mac-specific WaitableEvent implementation is a relatively recent addition to Chromium, I wonder if we could just revert the whole thing and fall back to whatever cross-platform implementation Chromium was using before? Could be more maintainable anyway.

krackers commented 2 years ago

Hm it's an interesting find. The bug comments also suggest that watching videos (or doing something with GPU heavy activity) is more likely to trigger the bug.

Libdispatch does make use of kevent in its work loop from what I can see, so it's a decent guess. But I'm not really familiar with this part of osx internals so I cannot say for sure.

The comment is a bit hard to parse, but from the surrounding discussion it seems like

Mach port sets cause issues with 10.11+
KQueue doesn't work for non port-sets on <= 10.11
Hence 3 separate implementations for 10.9, 10.11, 10.12+

It's not clear why Google didn't use libdispatch on <= 10.9, but I suppose they wanted the minimal fix possible to avoid introducing any additional instability.

I'm not sure about the "fix instabilities on loading DevTool and plugins" thing. What kind of instability was seen, and how was it narrowed down to the WaitMany implementation?

The mac-specific WaitableEvent implementation seems to have been added in v62, and up to v68 officially supported 10.9 so we may not need to revert that entirely (note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that).

Wowfunhappy commented 2 years ago

note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that

Because Chrome v62–68 set kPrimitive = PORT_SET on 10.9/10.10, whereas Chromium Legacy does not. That was the basis of my theory, anyway—I'm still confident that if an official version of Chrome ever led to kernel panics on an officially-supported OS, we would know about it. Someone somewhere would have reported it to Google and they would have treated it as a serious issue.

The easiest and most obvious thing would be to match what Chrome does—except, Bluebox may have had a reason for changing it in the first place. Maybe related to Lion support? (Notably, we know the KP exists on Lion because @pjpreilly has seen it and runs Lion.)

I guess the only way to find out at this point is to try... I'll see if I can find some spare cpu time to recompile Chromium this week, since BlueBox seems to still be away (or has wisely chosen to stay out of this quagmire).

Wowfunhappy commented 2 years ago

The good news is a re-built Chromium with kPrimitive = PORT_SET on 10.7–10.10. The bad news is I triggered the would-be kernel panic in this version of Chromium. Oh well.

But now that I have Chromium building again, maybe I can experiment with some other stuff... I neglected to add the messagepumpkqueue logging Krakers suggested, for example.

Wowfunhappy commented 2 years ago

The next thing I want to try is wholesale getting rid of https://bugs.chromium.org/p/chromium/issues/detail?id=932175 and going back to the POSIX implementation. This change was made in 2019 after Mavericks support was dropped, and it clearly touches the kernel and kevent.

This is proving a bit challenging, as a lot of files were touched and things have been renamed, etc.

Wowfunhappy commented 2 years ago

Okay, actual progress! I just noticed something I should have noticed earlier!

The log message I added to XNU which indicates an averted kernel panic is always immediately followed, within a second, by a crash from a Chromium Helper process! These crashes do not appear to have any user-visible effects, but they do provide a backtrace. See attached!

So, I can now say pretty confidently that the bug is somewhere in MessagePumpKqueue.

Chromium Helper_2022-03-14-112231_Jonathans-Mac-Pro.crash.zip

krackers commented 2 years ago

sorry for the late response, that's a very interesting find. It means that the EBADF code path is actually not supposed to happen at all, since their code only has checks for EINTR, hence the process crash. I don't know if this means that this is a case where on a 10.11 machine chromium helper process would have crashed as well, or whether it's a more fundamental bug in 10.9 mach ipc and adding the ebadf case in the kernel source just works around the issue/shifts the symptom from "stop the world kernel panic" to process crash.

Either way as you mentioned it confirms that the kpanic is triggered while one thread is waiting on kevent inside MessagePumpKqueue. It's still not clear though what's the other thread/process doing the action that causes the EBADF to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes? Then again none of the log statements you added to the kernel got triggered so that means the EBADF is being signalled from either vm_fault.c, sync_sema.c, or ipc_mqueue.c. But even if we narrow it down to one of those I wouldn't know how to find the corresponding userspace chromium code that results in that call since all of those are pretty broad (I still bet it's in ipc_mqueue though).

I haven't looked too much into what it would take to revert the use of MessagePumpKqueue but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO, can we change that to the posix-generic messagepumplibevent ? Although I see that some of the other mac-specific parts have been switched to communicate using mach-ports which means we'd need to go back and revert those too...

Wowfunhappy commented 2 years ago

It's still not clear though what's the other thread/process doing the action that causes the EBADF to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes?

The other thing I noticed was WaitableEvent (and specifically WaitableEvent::TimedWait) being near the top in several other threads. I still feel like that's implicated somehow, even though switching kPrimitive didn't resolve the issue.

I haven't looked too much into what it would take to revert the use of MessagePumpKqueue but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO, can we change that to the posix-generic messagepumplibevent ?

Yes, that's what I was trying to do last week before I noticed the process crash! The original set of changes was documented here. I had a bit of trouble setting it back to the POSIX version, as the changes touched a number of different files which have themselves evolved since 2019, but I'd like to try again at some point.

blueboxd commented 2 years ago

Thank you for the deep inspection of panic🙇

I traced this panic with kernel debugger before, but I couldn't find the root cause of panic. (there was nothing special about the parameters of kevent64 preceding the panic) Maybe it is not due to a single kevent64, but rather a combination of other events in the queue or running threads...?

Wowfunhappy commented 2 years ago

@blueboxd Hey there! As a quick catch up, I recompiled XNU so that whenever kqueue_scan_continue is called with a wait_result value of THREAD_RESTART, XNU will log a warning and return EBADF instead of panicking.

This resolves the kernel panic, and causes Chromium to merely crash instead. I attached a backtrace here. It points to Chromium's MessagePumpKqueue as the culprit.

I suspect that if we can fix that crash, the kernel panics will go away!

blueboxd commented 2 years ago

@Wowfunhappy Thank you for the catch-up!

Stack trace of kernel debugger also indicates kevent64 in MessagePumpKqueue::DoInternalWork caused panic. But, if parameters for kevent64 are not the root cause, how we can resolve panics/crashes...? (simple re-writing of DoInternalWork may not resolve the issue:thinking:) I'll try to modify the kernel for debugging too.

@aeiouaeiouaeiouaeiouaeiouaeiou Chromium crashes are not the topic of this issue. (this issue induces kernel panic in normal environment) Please open a new issue with the crash log.

krackers commented 2 years ago

It might be worth trying to figure out where the EBADF is signaled from in the kernel (since we ruled out the origins in the kevent code, that leaves only vm_fault.c, sync_sema.c, or ipc_mqueue.c). As you mentioned, it's likely some race condition somewhere between different threads that's causing ebadf to sent to the thread while inside the kernel wait loop.

Wowfunhappy commented 2 years ago

But, if parameters for kevent64 are not the root cause, how we can resolve panics/crashes...?

One thing we could try is using messagepumplibevent instead of MessagePumpKqueue. I think you'd just need to revert the set of changes described here: https://bugs.chromium.org/p/chromium/issues/detail?id=932175. The changes were a bit too complex for me, but you might have more success.