Closed LubosD closed 4 years ago
Test app that triggers the bug: multipset-example.tar.gz
A backstory for people who follow along: we traced this all the way through the stack from buttons not handling mouse clicks properly.
A control like NSButton
has basically two ways of handling events sequences like "mouse down, mouse move, mouse up", either by remembering the fact of mouseDown
and returning back to the event loop, like this:
- (void) mouseDown: (NSEvent *) event {
_mousePressed = YES;
[self setNeedsDisplay: YES];
}
- (void) mouseUp: (NSEvent *) event {
if (_mousePressed) {
// handle the click
[self sendAction: [self action] to: [self target]];
}
_mousePressed = NO;
[self setNeedsDisplay: YES];
}
(this a lot like what most async web frameworks do, except they come with powerful "futures" abstractions and async
/await
syntax sugar to make keeping the state across callbacks simpler.)
Alternatively, it can run a nested run loop inside mouseDown:
like this:
- (void) mouseDown: (NSEvent *) event {
_mousePressed = YES;
[self setNeedsDisplay: YES];
event = [[self window] nextEventMatchingMask: NSLeftMouseUpMask];
// handle the click
[self sendAction: [self action] to: [self target]];
_mousePressed = NO;
[self setNeedsDisplay: YES];
}
See this doc for more details.
By default, NSControl
uses the second approach, running a nested run loop. It uses both -[NSCell trackMouse:inRect:ofView:untilMouseUp:]
(which wraps -[NSWindow nextEventMatchingMask:]
) and -[NSWindow nextEventMatchingMask:]
itself.
Next, -[NSWindow nextEventMatchingMask:]
ends up running [NSRunLoop currentRunLoop]
in NSEventTrackingRunLoopMode
. NSRunLoop
is a thin wrapper over CFRunLoop
. Each run loop mode corresponds to a Mach port set that loops listens on when run in this mode.
The bug we've been seeing is clicking a button would hang the process forever; nextEventMatchingMask:
never returned. What happened was the run loop was never woken up by the X11 socket becoming readable. The way that is supposed to work is there is a separate thread, __CFSocketManagerThread
, that gets spawned the first time you add a CFSocket
to a run loop; this thread select()
s on the Unix fds run loops should listen on, and as soon as one becomes ready, the socket manager thread sends a Mach message to that run loop's wakeup port (then the loop wakes up and services that CFSocket
).
So whenever a new run loop mode is initialized, the wakeup port of that run loop (a loop only has one wakeup port) would get inserted into new mode's port set -- and if there are multiple modes, the same wakeup port would get inserted into each mode's port set.
There aren't a lot of docs that mention inserting a port into multiple sets, and ones that exist contradict each other (& at times themselves). Some say that:
a port can only belong to one portset at once
others:
If the receive right is already a member of another port set, that relationship is unafected by this operation. A receive right can be in multiple port sets simultaneously.
Currently on Darling, inserting a port into multiple port sets succeeds (with KERN_SUCCESS
), but it doesn't work, i.e. a thread listening for messages sent to a port set (one that the port was inserted into, but not the first one of those) doesn't get woken up when a message is sent to the port. Presumably, this is because the XNU code does support having a port in multiple port sets, but our Linux duct tape code doesn't account for the case that there may be multiple port sets this port is a member of so there may be multiple threads that it needs to wake up.
So for the CFRunLoop
& X11 socket case it means that waking up the run loop on the socket becoming readable only works if the run loop runs in the NSDefaultRunLoopMode
(aka kCFRunLoopDefaultMode
), which is why the events were never delivered to the app, which is why the button wouldn't respond.
As a workaround until the kernel bug is fixed, I build AppKit with NSEventTrackingRunLoopMode
and other run loop mode names changed to equal NSDefaultRunLoopMode
(it's not enough for them to be CFSTR("kCFRunLoopDefaultMode")
, they really need to point to that same CFString
object; this also causes headache in Swift, where strings are implicitly bridged to Swift's native String
type, so it's harder to keep object identity where needed, but that's another story).
To summarize, that's an interesting and rare case where a bug in the LKM manifested as a problem with the UI, having to do with Mach ports, Unix sockets, X11, threads and event loops. Wow.
The reason is how "hacked" ipc_mqueue_post
is in Darling.
I think it may be easier to finish the xnu-upgrade branch work where the intention is NOT to have the whole waiting system modified. (The XNU waiting system is coincidentally also completely overhauled in that branch.)
The test program now works in the branch for issue #275.
I consider the bug resolved - in the vchroot branch. Will be merged into master as soon as we make sure there are no major regressions.
It can be placed there, but then only one of these portsets will get woken up upon port activity.