NeuronRobotics / nrjavaserial

A Java Serial Port system. This is a fork of the RXTX project that uses in jar loading of the native code.
Other
345 stars 143 forks source link

initialise_event_info_struct: initialise failed! #111

Closed roxkamx closed 3 years ago

roxkamx commented 6 years ago

Hi,

I have a project where I attach 32 devices grouped in 2 physical hubs ( 16 each ) to a Centos7 server. With 16 devices my application works just fine.

When I start the application with both hubs attached ( i.e. 32 devices attached ) at the 24th device my application stops with the following error:

initialise_event_info_struct: initialise failed!

This message looks to be from RXTX ( as found on http://rxtx.qbang.org/pub/rxtx/rxtx-2.1-7r2/src/SerialImp.c )

Any suggestions how to overcome this issue? is there any limitation? The same thing happened on 2 other different PCs, so I would exclude for now the USB driver.

Thanks

madhephaestus commented 4 years ago

try 5.0.2

madhephaestus commented 4 years ago

unfortunately as i try to understand how this is happening, i have run into the 'goto' problem. Fail can be reached by 2 ways, one is id this statement returns true:

if (eis->fd < FD_SETSIZE && eis->fd > 0) 

and the other

if(eis->send_event == NULL)
madhephaestus commented 4 years ago

Now the errors will be :

initialise_event_info_struct: eis->send_event == NULL!
initialise_event_info_struct: eis->fd >= FD_SETSIZE!
initialise_event_info_struct: eis->fd <= 0!
wborn commented 4 years ago

For me when the error occurs it results in:

initialise_event_info_struct: eis->fd >= FD_SETSIZE!

madhephaestus commented 4 years ago

This issue seems to be related to this:

https://access.redhat.com/solutions/488623

madhephaestus commented 4 years ago

It seems we would need to transition away from SELECT() and switch to poll()

http://man7.org/linux/man-pages/man2/select.2.html

to

http://man7.org/linux/man-pages/man2/poll.2.html

This will result in lots of opening and closing code needing to be checked when its changed, and a lot of changes to the data structures used to keep track of the serial port events.

MrDOS commented 4 years ago

I agree that switching to poll(2) would be preferable for many reasons, but at the same time, FD_SETSIZE is 1024 just about everywhere. That's a lot of fds, even with 32 physical ports. @roxkamx and @wborn, do your applications churn through opening/closing the serial port many times, or do you typically acquire a port instance once and then hold it open for as long as you can? I'm trying to understand if fd churn is happening due to library or application behaviour. Regardless, this is clearly bad behaviour on the part of the library, but I wonder if we can't find an application-level workaround in the meantime.

wborn commented 4 years ago

It seems the library is keeping thousands of file descriptors to temporary files, e.g. lsof -p PID lists these lines:

java    15241 wouter  955u   REG   0,26         0      947 /run/lock/tmpj1rrF2 (deleted)
java    15241 wouter  956u   REG   0,26         0      948 /run/lock/tmpPwuMWV (deleted)
java    15241 wouter  957u   REG   0,26         0      949 /run/lock/tmpgPk8dP (deleted)
java    15241 wouter  958u   REG   0,26         0      950 /run/lock/tmp356uvI (deleted)
java    15241 wouter  959u   REG   0,26         0      951 /run/lock/tmpKAGSMB (deleted)
java    15241 wouter  960u   REG   0,26         0      952 /run/lock/tmpE64g4u (deleted)
java    15241 wouter  961u   REG   0,26         0      953 /run/lock/tmpxsjGlo (deleted)
java    15241 wouter  962u   REG   0,26         0      954 /run/lock/tmpKDm6Ch (deleted)
java    15241 wouter  963u   REG   0,26         0      955 /run/lock/tmpmLexUa (deleted)
java    15241 wouter  964u   REG   0,26         0      956 /run/lock/tmpuMVYb4 (deleted)
java    15241 wouter  965u   REG   0,26         0      957 /run/lock/tmpSNrrtX (deleted)
java    15241 wouter  966u   REG   0,26         0      958 /run/lock/tmp8FMUKQ (deleted)
java    15241 wouter  967u   REG   0,26         0      959 /run/lock/tmpLbYo2J (deleted)
java    15241 wouter  968u   REG   0,26         0      960 /run/lock/tmpKxYTjD (deleted)
java    15241 wouter  969u   REG   0,26         0      961 /run/lock/tmpGWOpBw (deleted)
java    15241 wouter  970u   REG   0,26         0      962 /run/lock/tmpLTtWSp (deleted)
java    15241 wouter  971u   REG   0,26         0      963 /run/lock/tmpcAWtaj (deleted)
java    15241 wouter  972u   REG   0,26         0      964 /run/lock/tmptSf2rc (deleted)
java    15241 wouter  973u   REG   0,26         0      965 /run/lock/tmps2oBJ5 (deleted)
java    15241 wouter  974u   REG   0,26         0      966 /run/lock/tmp5pfb1Y (deleted)
java    15241 wouter  975u   REG   0,26         0      967 /run/lock/tmp6fWLiS (deleted)
java    15241 wouter  976u   REG   0,26         0      968 /run/lock/tmpGatnAL (deleted)

It's easily reproduced by running the example program in https://github.com/NeuronRobotics/nrjavaserial/issues/180 and not inserting the USB stick. If I insert it after there are thousands of these file descriptors I immediately see the error. I think I've even seen it cause issues with other programs because of this.

MrDOS commented 4 years ago

This is really curious. Retrieving a specific port identifier by name (CommPortIdentifier.getPortIdentifier(String)) creates (and deletes) six lock files. And on my test system with six serial ports, 18 are created while scanning for port availability (CommPortIdentifier.getPortIdentifiers()). Then another one is created when opening the port, and another 12 are created while closing it.

Detecting ports opens each one, I think, to confirm that the user has access to the file. (I don't think there's a good reason for doing that instead of calling access(2), but that's the way it's done right now.) So that's one lock file per port detected, assuming the detection delegates out to the same routine for opening the port as when opening them regularly. On my test box, that should be six total, not 18. And if retrieving a port identifier also checks readability, then that should be another one, not six. And one would think that closing a port would not involve any open operations, just a close(2) for the port, and an unlink(2) for the lock file. So really, opening the port is the only operation which opens a sensible number of files. Wild. I'm going to dig in further.

MrDOS commented 4 years ago

I can confirm that the problem only exists when using locking. @roxkamx, if you want a quick-and-dirty workaround, you can add #define DISABLE_LOCKFILES to somewhere near the top of src/main/c/SerialImp.h (anywhere above line 374) and recompile the natives for your platform.

zyliu1985 commented 3 years ago

I can confirm that the problem only exists when using locking. @roxkamx, if you want a quick-and-dirty workaround, you can add #define DISABLE_LOCKFILES to somewhere near the top of src/main/c/SerialImp.h (anywhere above line 374) and recompile the natives for your platform.

Hi, is there any plan to fix this issue? Thank you.

MrDOS commented 3 years ago

Yes, I'd like to replace all of the locking with something more modern. But I don't have a timeline for getting to do it. Are you able to work around the issue for now by disabling lockfiles?

MrDOS commented 3 years ago

@roxkamx and @zyliu1985, can you please test again with this build of the library (from this workflow run) to see if #216 fixed this issue for you?

wborn commented 3 years ago

Many thanks for having a look at this @MrDOS! I gave the build with the changes a test and it seems to fix the issue for me. :+1:

All the temporary lock files are gone from the output of lsof when testing it with my example of https://github.com/NeuronRobotics/nrjavaserial/issues/180 (and I don't insert the USB device).

It also worked fine with openHAB. :slightly_smiling_face:

farfade commented 2 years ago

Hello, wouldn't worth it releasing a stable version of the library with this fix because it seems to be blocking, especially when dealing with debian bullseye (see openhab problems) ?