NeuronRobotics / nrjavaserial

A Java Serial Port system. This is a fork of the RXTX project that uses in jar loading of the native code.
Other
345 stars 143 forks source link

RXTXPort init hangs #113

Closed jaghatei closed 4 years ago

jaghatei commented 6 years ago

I have a situation in which RXTXPort.close() hangs and does not return. The situation occurs with an openHAB binding (https://github.com/openhab/openhab2-addons/pull/2167). The stack trace that I get when force unloading the binding after getting hung points to an invalidated file descriptor. I have no specific hint what exactly invalidates the fd. But with fd beeing zero the close() will permanently stuck due to first trying to set a lock and then check for fd <=0 for return. I would suggest to correct this order from

                       while( !IOLockedMutex.writeLock().tryLock(500, TimeUnit.MILLISECONDS) )
                        {
                                if( debug )
                                        z.reportln("IO is locked " + IOLockedMutex.getReadLockCount());
                        }

                        if ( fd <= 0 )
                        {
                                z.reportln(  "RXTXPort:close detected bad File Descriptor" );
                                return;
                        }

to

                        if ( fd <= 0 )
                        {
                                z.reportln(  "RXTXPort:close detected bad File Descriptor" );
                                return;
                        }
                       while( !IOLockedMutex.writeLock().tryLock(500, TimeUnit.MILLISECONDS) )
                        {
                                if( debug )
                                        z.reportln("IO is locked " + IOLockedMutex.getReadLockCount());
                        }
jaghatei commented 6 years ago

Issue is different than first assumed. Not port close is the problem, but port open.

waitForTheNativeCodeSilly does not return as MonitorThread seems not to run properly.

"ESH-thingHandler-3" #209 daemon prio=5 os_prio=0 tid=0x0835a000 nid=0x2ded sleeping[0x9fc9d000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at gnu.io.RXTXPort.waitForTheNativeCodeSilly(RXTXPort.java:903)
    at gnu.io.RXTXPort.<init>(RXTXPort.java:132)
    at gnu.io.RXTXCommDriver.getCommPort(RXTXCommDriver.java:925)
    at gnu.io.CommPortIdentifier.open(CommPortIdentifier.java:465)
    at gnu.io.factory.RxTxPortCreator.createPort(RxTxPortCreator.java:50)
    at gnu.io.factory.RxTxPortCreator.createPort(RxTxPortCreator.java:1)
    at gnu.io.factory.DefaultSerialPortFactory.createSerialPort(DefaultSerialPortFactory.java:53)
    at org.openhab.binding.meterreader.internal.sml.SmlSerialConnector.openConnection(SmlSerialConnector.java:102)
    at org.openhab.binding.meterreader.connectors.ConnectorBase.getMeterValues(ConnectorBase.java:54)
    at org.openhab.binding.meterreader.internal.MeterDevice.lambda$1(MeterDevice.java:137)
    at org.openhab.binding.meterreader.internal.MeterDevice$$Lambda$160/3227531.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Does anyone know how to trace / analyze this issue? The time it happens is not deterministic. Sometimes it happens a few minutes after program start, sometimes it needs days to happen.

jaghatei commented 6 years ago

seems the issue is based on two facts: 1) the lock file handling is not completely cleaning no longer used file handles. Per port open / close one of these entries is added as per lsof (with different file name each): java 3044 openhab2 492u REG 0,16 0 1560565 /run/lock/tmpnx8W16 (deleted) 2) once the limit of 1024 is reached no more files are opened by java due to user limit set to 1024. There is no Exception thrown for this fact. Port init / open is not returning at all.

jaghatei commented 6 years ago

Seems nrjavaserial has a bug in SerialImp.c line 691 and 702:

int cmd=FD_CLOEXEC|F_SETFL;
ret= fcntl(fd,cmd,O_RDWR|O_NONBLOCK);

does not do what is intended (not setting CLOEXEC flag nor RDWR nor NONBLOCK) FD_CLOEXEC|F_SETFL = 4 | 1 = 5 = F_GETOWN as command

you either have to:

ret= fcntl(fd,F_SETFD,FD_CLOEXEC);
ret= fcntl(fd,F_SETFL,O_RDWR|O_NONBLOCK);

or as atomic: ret= fcntl(fd,F_SETFL,O_CLOEXEC|O_RDWR|O_NONBLOCK); I applied the last one and it solves my issue.

bib2 commented 5 years ago

I think i have the same problem:

I use sml meterreader binding in openhab 2.4 snapshot #1332. This uses nrjavaserial 3.14. After few minutes, meterreader binding stops getting updates from my power meter. After openhab-service restart it works again for a few minutes.

It worked without problems until oh2.4 snapshot update from a month ago. The time before i was on snapshot from may 2018 and there it worked, after updating end of july and auguts 2018 it stoped working.

Here you can see further information and an output of jstack from openhab2: https://community.openhab.org/t/sml-reader-how-to-integrate/20206/134

eolivelli commented 5 years ago

I have the same problem. Do you have a fix ? I am seeing this kind of log initialise_event_info_struct: initialise failed!

then my thread is stuck in

 java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(java.base@11.0.3/Native Method)
 at gnu.io.RXTXPort.waitForTheNativeCodeSilly(RXTXPort.java:903)
 at gnu.io.RXTXPort.<init>(RXTXPort.java:132)
 at gnu.io.RXTXCommDriver.getCommPort(RXTXCommDriver.java:968)
 at gnu.io.CommPortIdentifier.open(CommPortIdentifier.java:465)

I am using latest and greatest nrjavaserial-3.15.0.jar on Linux

uname -a
Linux xx.xx.xx 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
madhephaestus commented 4 years ago

we have an updated jar, 3.19.0, please check and let me know if this issue is fixed?

jaghatei commented 4 years ago

Need to setup a test environment first to confirm. Will need some time. But as the code lines mentioned above are still buggy in current code I assume the problem is still existing.

madhephaestus commented 4 years ago

@jaghatei we have changed a lot in the way the lock files are handled, so this might be fixed upstream of the issue are having.

jaghatei commented 4 years ago

Sadly not. I simply took the test prog from you repostory and the current 3.20.1 version of the jar on Linux ds218plus 4.4.59+ #24922 SMP PREEMPT Mon Aug 19 12:13:37 CST 2019 x86_64 GNU/Linux synology_apollolake_218+

java -cp ".:/var/services/homes/admin/nrjavaserial-3.12.0.1.jar" test/NRJavaSerialTest

producing more and more open file handles per try to open /dev/ttyUSB0: java 674 root 41u REG 0,17 0 1145120 /run/lock/tmppfBaDZ (deleted) java 674 root 42u REG 0,17 0 1145122 /run/lock/tmpxWeAZr (deleted) java 674 root 43u REG 0,17 0 1145124 /run/lock/tmp9hP0lU (deleted) java 674 root 44u REG 0,17 0 1145392 /run/lock/tmpDPOGEx (deleted) java 674 root 45u REG 0,17 0 1145394 /run/lock/tmpVFqsXa (deleted) java 674 root 46u REG 0,17 0 1145395 /run/lock/tmpjhDEgO (deleted) java 674 root 47u REG 0,17 0 1145397 /run/lock/tmprVRkBr (deleted) java 674 root 48u REG 0,17 0 1145399 /run/lock/tmp9Tm2V4 (deleted) java 674 root 49u REG 0,17 0 1145401 /run/lock/tmpvcPKgI (deleted) java 674 root 50u REG 0,17 0 1145781 /run/lock/tmpf5v7ww (deleted) java 674 root 51u REG 0,17 0 1145783 /run/lock/tmpnjgxNk (deleted) java 674 root 52u REG 0,17 0 1145784 /run/lock/tmpBT0338 (deleted) java 674 root 53u REG 0,17 0 1145786 /run/lock/tmppmg2lX (deleted) java 674 root 54u REG 0,17 0 1145788 /run/lock/tmpV8A1DL (deleted) java 674 root 55u REG 0,17 0 1145790 /run/lock/tmphMP1Vz (deleted)

so a prog using the lib will stall again once ulimit is reached.

jaghatei commented 4 years ago

forget the remark, wrong test, need to repeat

jaghatei commented 4 years ago

same behaviour with correct libs (3.20.1 and 4.2), so error is still in: java 7396 root 5r REG 0,34 755396 713664 /volume1/homes/admin/nrjavaserial-3.20.1.jar java 7396 root 6u REG 0,17 0 1273482 /run/lock/tmpoJ2QX7 (deleted) java 7396 root 7u REG 0,17 0 1273484 /run/lock/tmp85Pqi8 (deleted) ... java 7396 root 96u REG 0,17 0 1279301 /run/lock/tmpyRaoEC (deleted) java 7396 root 97u REG 0,17 0 1279303 /run/lock/tmpcyEMLK (deleted)

java 8056 root 5r REG 0,34 755383 713662 /volume1/homes/admin/nrjavaserial-4.2.0.jar java 8056 root 6u REG 0,17 0 1282628 /run/lock/tmpH1DPy3 (deleted) java 8056 root 7u REG 0,17 0 1282630 /run/lock/tmpXHKt4h (deleted) ... java 8056 root 66u REG 0,17 0 1289583 /run/lock/tmpxZhiPz (deleted) java 8056 root 67u REG 0,17 0 1289585 /run/lock/tmprIdRh3 (deleted)

madhephaestus commented 4 years ago

take a look at the new 3.21.0 i just pushed @jaghatei

madhephaestus commented 4 years ago

closing issue, will reopen if user reports same error