NeuronRobotics / nrjavaserial

A Java Serial Port system. This is a fork of the RXTX project that uses in jar loading of the native code.
Other
345 stars 143 forks source link

In case of HARDWARE_ERROR event serial.disconnect() never returns #197

Open mvalla opened 3 years ago

mvalla commented 3 years ago

Hallo, I am using 5.2.1 on Win 10 x64 and java 8 (1.8.0_172) When I run: https://github.com/NeuronRobotics/nrjavaserial/blob/master/test/src/test/ReadTest.java in case of HARDWARE_ERROR event, serial.disconnect() never returns. In fact if add:

System.out.println("Clean exit of hardware");
serial.disconnect();
System.out.println("after disconnect");

"after disconnect" is never printed. Thread [RXTXPortMonitor(//./COM3)] is still running, and if I suspend it I get this stack:

RXTXPort.interruptEventLoop() line: not available [native method]   
RXTXPort.removeEventListener() line: 867    
RXTXPort.close() line: 1091 
NRSerialPort.disconnect() line: 153 
MySerialDisconnectTest.lambda$0(DataInputStream, NRSerialPort, SerialPortEvent) line: 42    
1740189450.serialEvent(SerialPortEvent) line: not available 
RXTXPort.sendEvent(int, boolean) line: 787  
RXTXPort$MonitorThread.run() line: 1628 
MrDOS commented 3 years ago

Interesting. A quick glance at RXTXPort.interruptEventLoop() doesn't reveal anything that should get “stuck”. The linked-list search is the obvious suspect at a glance, but it's not possible to open a port without populating that list, and nothing every removes old entries from that list, so that shouldn't be it. But the only code after that is Windows-specific, none of which looks like it should block. So it must be getting hung up in that search loop.

How are you reproducing the HARDWARE_ERROR event? What serial port device are you using, a USB-to-serial dongle? Can you tell us what chipset it's using, or if you don't know, can you provide the exact part number of the device, or a link to a webpage for it?

mvalla commented 3 years ago

I create the HARDWARE_ERROR by removing my dongle from the USB port. The dongle is a ZigBee dongle by Legrand/BTicino based on EM3582 chipset from Silicon Labs. Another hint, maybe related: in another test class I use with JUnit, after HARDWARE_ERROR and when the test ends these logs are printed on std out:

Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2611): Access is denied.

which test can I do to help indentify the problem?

d5smith commented 3 years ago

I'm reproducing this constantly on macos (Catalina and BigSur) with Java 15 except it causes a SIGSEGV and crashes the JVM :(

It appears to be a timing issue on the destruction of RXTXPort and RXTXPort.MonitorThread. When the native layer detects the device is gone it throws java.io.IOException: Device not configured in nativeavailable If I break there it forestalls the crash

I modified the ReadTest to add a loop that just check available() on the input stream and I trigger the HARDWARE_ERROR by disconnecting the instrument ffrom USB.

Is there anything I can add to help troubleshoot?

MrDOS commented 3 years ago

Thanks for the data points. From your notes, I've written this reproduction test case. Unfortunately, I cannot actually reproduce the issue on macOS 10.14 (Mojave) with Java 11 or 15 and the only serial device I have near to hand, an FTDI FT2232D-based adapter. I'll try again on a Windows box, but given the spread of OSes and JVM versions already at play, I think the cause of the issue is probably more related to driver-specific timings. I can certainly try to identify a solution without having a local reproduction, but it'll be harder to nail down.

What hardware are you using for this? The hardware @mvalla mentioned seemed prohibitively expensive to get for a test case. I'm crossing my fingers that yours is something more common.

d5smith commented 3 years ago

Thanks for the reply.

I'm using RBR CTDs, ;) Trying to upgrade Ruskin to use the latest nrjavaserial but the JVM crashing on Macs on instrument disconnect is blocking that. I'm also able to consistently reproduce it with just one of these .

I'm able to reproduce the issue with your test case (which is simpler than mine) when I debug it in eclispe (with no breakpoints set). If I run it however it does not error.

My initial test case simply added a loop to ReadTest.java before the catch block so as to keep the thread alive while running the test. This fails consistantly and the offending frame is always: C [libNRJavaSerial.jnilib+0x4574] drain_loop+0x34

I'll see if I can attach a debugger to the native code and find out anything more.

mvalla commented 3 years ago

Just tested your DisconnectTest.java using Aeontec Z-Stick Gen5 ZW090-C, nrjavaserial 5.2.1 on Win 10 x64 and java 11.0.8. From command line I do not experience any JVM crash, but I get this logger lines and some system outs:

22:33:51.585 [INFO ] [main        ] [org.serial.test.DisconnectTest:19   ] - Opened port COM6 and beginning to read. Trigger a hardware error now.
22:33:57.300 [INFO ] [main        ] [org.serial.test.DisconnectTest:26   ] - Caught an exception
java.io.IOException: Result too large in nativeavailable
        at gnu.io.RXTXPort.nativeavailable(Native Method)
        at gnu.io.RXTXPort$SerialInputStream.available(RXTXPort.java:1626)
        at org.serial.test.DisconnectTest.main(DisconnectTest.java:23)
22:33:57.300 [INFO ] [main        ] [org.serial.test.DisconnectTest:29   ] - Going to disconnect the port...
22:33:57.500 [INFO ] [main        ] [org.serial.test.DisconnectTest:31   ] - Disconnected. If you see this, crash reproduction failed.
Error 0x3e3 at src/windows/termios.c(512): The I/O operation has been aborted bError 0x3e3 at src/windows/termios.c(2725): The I/O operation has been aborted Error 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2611): The device does not recognize the co
d5smith commented 3 years ago

Here's the ReadTest variant I am using. Crashes every time on OSX

MrDOS commented 3 years ago

I'm using RBR CTDs, ;)

Ah, right – that David Smith! 🙂

Thanks for sharing your test code. I can reliably reproduce the SIGSEGV crash under both Java 8 and 15 on Mojave with that. Turns out my reproduction didn't catch the segfault because it bailed too soon: as soon as the hardware error was thrown, it exited. Your test loops on the thread being interrupted on line 64, but nothing ever interrupts it, so it continues to live on even after the serial port has gone away – and lives on long enough for the monitor thread to crash badly.

I've updated my test case to include some wait time after disconnect for the monitor thread to go down in flames. I've also retooled it to use the more conventional JCA-based API. I can reproduce the segfault with this on NRJavaSerial v5.1.0, but not the previous version, v5.0.2. The big change between those two versions was #172, which introduced the HARDWARE_ERROR event. This crash is likely due to an oversight in some of the post-disconnect cleanup.

Aeontec Z-Stick Gen5 ZW090-C

Well, that's cheaper (and easier to get) than the previous adapter at least, but I'd still like to see if I can get this fixed without buying hardware to reproduce. I'll hunt down the segfault first because I now have a good reproduction case for that, and hopefully getting that fixed will resolve the hang too. I'm getting less sure that the two symptoms are really caused by the same bug and not by two separate bugs that both happen to be triggered by hardware disconnects. Regardless, thanks for chiming in with reproduction notes so quickly.

mvalla commented 3 years ago

I've updated my test case

If I run the test case on windows no crash, but the following logs are returned.

Opened port COM3 and beginning to read. Trigger a hardware error now.
Caught an exception: Going to disconnect the port...
No error in nativeavailable
Disconnected. Now going to sleep for 2.0s to give the monitor thread a chance to SIGSEGV.
If you're seeing this, that means the monitor thread didn't segfault, and bug reproduction has failed.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2611): Access is denied.

I can get this fixed without buying hardware to reproduce

You can count on me for testing on Windows and MacOS, with the 2 USB serial hw I have.

MrDOS commented 3 years ago

Thanks. I forgot to mention that I did try all three (well, now four) tests on Windows and couldn't reproduce a hang on disconnect() or JVM crash, but I did also get a lot of those “Access is denied” errors. They're annoying but, I think, unrelated.

To date, have you encountered the hang on disconnect() with any hardware other than the original Legrand Zigbee dongle?

mvalla commented 3 years ago

To date, have you encountered the hang on disconnect() with any hardware other than the original Legrand Zigbee dongle?

The hang on disconnect() happens in Eclipse with both the USB devices I have (Legrand and Aeotec). This is the code I use If I debug this code in eclipse, then remove the stick, it prints "Clean exit of hardware" but then line 43 is never reached.

MrDOS commented 3 years ago

Oooops – gotta be real careful with those command words! Thanks for reopening.

MrDOS commented 3 years ago

I think #211 might fix the JVM crash, but it probably won't have any bearing on the hang-on-close()/disconnect(). At this point, I'm pretty sure they're unrelated issues. Can you guys try the library from this build, nrjavaserial-fd5fadf?

mvalla commented 3 years ago

Just tried with both USB sticks. Actually with this fixed version upon stick removal I now do experience a JVM crash on Windows , with this log:

ERROR: JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
JDWP exit error AGENT_ERROR_NO_JNI_ENV(183):  [util.c:840]
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
MrDOS commented 3 years ago

Fascinating. I'll try on my Windows box as soon as I get a chance, but can you confirm you get that every time? What if you just run the test without attaching a debugger? Not that I want to deflect, but it looks like it could be a very old JVM bug. util.c is certainly not part of our source at the very least; neither is JDWP.

mvalla commented 3 years ago

That was Java 8. But actually with your modified lib the JVM crash happens all times, with both USB sticks, with both Java versions (8 and 11) and with both Eclipse debugger / command line. This is the output using your modified lib, Java 11, from command line (the JVM crashed after removing the USB stick):

java -cp ".\target\classes;C:\Users\myuser\.m2\repository\org\slf4j\slf4j-log4j12\1.7.28\slf4j-log4j12-1.7.28.jar;C:\Users\myuser\.m2\repository\org\slf4j\slf4j-api\1.7.28\slf4j-api-1.7.28.jar;C:\Users\myuser\.m2\repository\log4j\log4j\1.2.17\log4j-1.2.17.jar;.\lib\nrjavaserial-5.2.1-alfa.jar" org.serial.test.MySerialDisconnectTest
Java version: 11.0.8
Availible port: COM3
Connected
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
claui commented 7 months ago

Interesting. A quick glance at RXTXPort.interruptEventLoop() doesn't reveal anything that should get “stuck”. The linked-list search is the obvious suspect at a glance, but it's not possible to open a port without populating that list, and nothing every removes old entries from that list, so that shouldn't be it.

If a monitor thread happens to catch some signal and dies without its event_loop getting to call finalize_event_struct, then the stale struct will remain in the linked list even though the stack frame or whole stack is freed. If the stack gets reused by a new monitor thread later, its stack frames might align exactly with those of the old thread. In that case, the pointer to the struct would end up in the list twice, causing a cycle while navigating the list.

MrDOS commented 7 months ago

Holy smokes, you're really peeling back the covers on a couple of really old bugs here. Thank you so much for your efforts!