NeuronRobotics / nrjavaserial

A Java Serial Port system. This is a fork of the RXTX project that uses in jar loading of the native code.
Other
345 stars 143 forks source link

Fix macOS segfault on close after hardware error #211

Open MrDOS opened 3 years ago

MrDOS commented 3 years ago

This fixes the JVM crash experienced on macOS after a hardware error occurs. The crash was introduced along with the core hardware-error-detection functionality in #172.

In order to generate the OUTPUT_BUFFER_EMPTY event, the monitor thread event loop periodically checks the status of the serial port output buffer, just as it checks for other state changes. On platforms which have it, and on Windows where it's emulated, it uses the TIOCSERGETLSR ioctl to get the information. On platforms which don't have that ioctl (macOS and FreeBSD <10), the monitor thread starts up another “drain thread” under the monitor thread which loops on tcdrain(3), and generates the event whenever that call unblocks. That separate drain thread is usually terminated by RXTXPort.interruptEventListener(). However, because the monitor thread event loop now internally terminates upon encountering a hardware error, the drain thread cleanup in that interruption method isn't being performed. The fix is to have the event loop call the interruption method when it detects a hardware failure. That ensures all of the normal monitor thread cleanup happens.

However, doing that revealed another, rare race condition where the application thread could encounter the port failure and invoke RXTXPort.close() before the monitor thread began its shutdown (e.g., if the application is sitting on a tight loop around available(), as in the case of my DisconnectTest test case). This manifested in another segfault. The obvious solution was some locking around the internal state of the monitor thread. This already kind of existed in the form of the RXTXPort.MonitorThreadLock boolean and the RXTXPort.waitForTheNativeCodeSilly() method which waited for the flag to become unset in a 5-second sleep loop. I took this as an opportunity to replace that nonsense with a real lock. I chose a read/write lock rather than just synchronized blocks around an object because it more closely matches the semantics of the old behaviour: all of the I/O functions checked that the flag was clear, which is akin to sharing a read lock, and all of the notification reconfiguration methods set/cleared the flag, which is akin to an exclusive write lock.

This should fix the issue @d5smith raised in #197. It probably doesn't fix the original hang-on-close()/disconnect() issue @mvalla is facing, unless the new locking has accidentally cleaned up another concurrency issue.

d5smith commented 3 years ago

I can confirm this fixes the behaviour i was seeing on macOS. Thanks!

MrDOS commented 3 years ago

Great! I have not yet tried this build on any platform other than macOS, nor have I tested behaviour other than failures (have not even run any data through it), so I want to do some more QC before merging. But thank you for trying it so quickly.

claui commented 7 months ago

@MrDOS According to preliminary tests, this appears to be fixing the crashes in interruptEventLoop on our x86_64 Linux boxes.

If combined with PR #249, all the crashes we’ve been getting at disconnect time no longer occur.