darlinghq / darling

Darwin/macOS emulation layer for Linux
http://www.darlinghq.org
GNU General Public License v3.0
11.43k stars 442 forks source link

Placing a Mach port into multiple port sets doesn't work #368

Closed LubosD closed 4 years ago

LubosD commented 6 years ago

It can be placed there, but then only one of these portsets will get woken up upon port activity.

LubosD commented 6 years ago

Test app that triggers the bug: multipset-example.tar.gz

bugaevc commented 6 years ago

A backstory for people who follow along: we traced this all the way through the stack from buttons not handling mouse clicks properly.

A control like NSButton has basically two ways of handling events sequences like "mouse down, mouse move, mouse up", either by remembering the fact of mouseDown and returning back to the event loop, like this:

- (void) mouseDown: (NSEvent *) event {
    _mousePressed = YES;
    [self setNeedsDisplay: YES];
}

- (void) mouseUp: (NSEvent *) event {
    if (_mousePressed) {
        // handle the click
        [self sendAction: [self action] to: [self target]];
    }
    _mousePressed = NO;
    [self setNeedsDisplay: YES];
}

(this a lot like what most async web frameworks do, except they come with powerful "futures" abstractions and async/await syntax sugar to make keeping the state across callbacks simpler.) Alternatively, it can run a nested run loop inside mouseDown: like this:

- (void) mouseDown: (NSEvent *) event {
    _mousePressed = YES;
    [self setNeedsDisplay: YES];
    event = [[self window] nextEventMatchingMask: NSLeftMouseUpMask];
    // handle the click
    [self sendAction: [self action] to: [self target]];
    _mousePressed = NO;
    [self setNeedsDisplay: YES];
}

See this doc for more details.

By default, NSControl uses the second approach, running a nested run loop. It uses both -[NSCell trackMouse:inRect:ofView:untilMouseUp:] (which wraps -[NSWindow nextEventMatchingMask:]) and -[NSWindow nextEventMatchingMask:] itself.

Next, -[NSWindow nextEventMatchingMask:] ends up running [NSRunLoop currentRunLoop] in NSEventTrackingRunLoopMode. NSRunLoop is a thin wrapper over CFRunLoop. Each run loop mode corresponds to a Mach port set that loops listens on when run in this mode.

The bug we've been seeing is clicking a button would hang the process forever; nextEventMatchingMask: never returned. What happened was the run loop was never woken up by the X11 socket becoming readable. The way that is supposed to work is there is a separate thread, __CFSocketManagerThread, that gets spawned the first time you add a CFSocket to a run loop; this thread select()s on the Unix fds run loops should listen on, and as soon as one becomes ready, the socket manager thread sends a Mach message to that run loop's wakeup port (then the loop wakes up and services that CFSocket).

So whenever a new run loop mode is initialized, the wakeup port of that run loop (a loop only has one wakeup port) would get inserted into new mode's port set -- and if there are multiple modes, the same wakeup port would get inserted into each mode's port set.

There aren't a lot of docs that mention inserting a port into multiple sets, and ones that exist contradict each other (& at times themselves). Some say that:

a port can only belong to one portset at once

others:

If the receive right is already a member of another port set, that relationship is unafected by this operation. A receive right can be in multiple port sets simultaneously.

Currently on Darling, inserting a port into multiple port sets succeeds (with KERN_SUCCESS), but it doesn't work, i.e. a thread listening for messages sent to a port set (one that the port was inserted into, but not the first one of those) doesn't get woken up when a message is sent to the port. Presumably, this is because the XNU code does support having a port in multiple port sets, but our Linux duct tape code doesn't account for the case that there may be multiple port sets this port is a member of so there may be multiple threads that it needs to wake up.

So for the CFRunLoop & X11 socket case it means that waking up the run loop on the socket becoming readable only works if the run loop runs in the NSDefaultRunLoopMode (aka kCFRunLoopDefaultMode), which is why the events were never delivered to the app, which is why the button wouldn't respond.

As a workaround until the kernel bug is fixed, I build AppKit with NSEventTrackingRunLoopMode and other run loop mode names changed to equal NSDefaultRunLoopMode (it's not enough for them to be CFSTR("kCFRunLoopDefaultMode"), they really need to point to that same CFString object; this also causes headache in Swift, where strings are implicitly bridged to Swift's native String type, so it's harder to keep object identity where needed, but that's another story).

To summarize, that's an interesting and rare case where a bug in the LKM manifested as a problem with the UI, having to do with Mach ports, Unix sockets, X11, threads and event loops. Wow.

LubosD commented 4 years ago

The reason is how "hacked" ipc_mqueue_post is in Darling.

I think it may be easier to finish the xnu-upgrade branch work where the intention is NOT to have the whole waiting system modified. (The XNU waiting system is coincidentally also completely overhauled in that branch.)

LubosD commented 4 years ago

The test program now works in the branch for issue #275.

LubosD commented 4 years ago

I consider the bug resolved - in the vchroot branch. Will be merged into master as soon as we make sure there are no major regressions.