arton / rjb

Ruby Java Bridge
https://www.artonx.org/collabo/backyard/?RubyJavaBridge
GNU Lesser General Public License v2.1
117 stars 34 forks source link

rjb & ruby 2.6.5 - system calls fail to exit #72

Closed kwerle closed 3 years ago

kwerle commented 4 years ago

I'm afraid I do not have a trivial test case for this issue.

We are using a 3rd party Java library (chemaxon). We go across the bridge and set up a database connection on the java side. Once we do this, it causes system commands to not exit. In brief:

  CacheRegistrationUtil =  Rjb::import("chemaxon.jchem.db.CacheRegistrationUtil")
  registration_util = CacheRegistrationUtil.new(connection_handler)
  registration_util.registerCache()

...

  system('ls')

The system command executes the ls. The output is displayed. But the system call never exits.

root@5138daa5ec6e:/tmp/src# ps ax
  PID TTY      STAT   TIME COMMAND
...
 1210 pts/0    Sl+    0:12 ./test.rb
 1249 pts/0    Z+     0:00 [ls] <defunct>

rbspy recording:

Summary of profiling data so far:
% self  % total  name
100.00   100.00  <c function> - unknown
  0.00   100.00  runner - /usr/local/bundle/ruby/2.6.0/gems/railties-5.0.7.2/lib/rails/commands/commands_tasks.rb
  0.00   100.00  run_command! - /usr/local/bundle/ruby/2.6.0/gems/railties-5.0.7.2/lib/rails/commands/commands_tasks.rb
  0.00   100.00  require_command! - /usr/local/bundle/ruby/2.6.0/gems/railties-5.0.7.2/lib/rails/commands/commands_tasks.rb
  0.00   100.00  <top (required)> - /usr/local/bundle/ruby/2.6.0/gems/railties-5.0.7.2/lib/rails/commands/runner.rb
  0.00   100.00  <top (required)> - /usr/local/bundle/ruby/2.6.0/gems/railties-5.0.7.2/lib/rails/commands.rb
  0.00   100.00  <top (required)> - /tmp/src/test.rb
  0.00   100.00  <main> - ./bin/rails

This did NOT happen in ruby 2.5.5. It still happens in 2.7.0rc1

Any help appreciated. If there is anything I can do to help, please let me know.

All testing done in docker containers (ruby:2.5.5, ruby:2.6.5)

kwerle commented 4 years ago

gdb backtrace:

#0  0x00007fdeba798916 in __GI_ppoll (fds=fds@entry=0x7ffdd0b4d350, 
    nfds=nfds@entry=2, timeout=<optimized out>, sigmask=sigmask@entry=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00007fdebaf09332 in native_ppoll_sleep (th=th@entry=0x55bc09ddbea0, 
    rel=rel@entry=0x0) at hrtime.h:148
#2  0x00007fdebaf09b9a in native_sleep (th=th@entry=0x55bc09ddbea0, rel=0x0)
    at thread_pthread.c:2170
#3  0x00007fdebaf0dc8e in rb_thread_sleep_interruptible () at thread.c:1307
#4  0x00007fdebae77e55 in waitpid_sleep (x=x@entry=140728104965520)
    at process.c:1102
#5  0x00007fdebadc5966 in rb_ensure (
    b_proc=b_proc@entry=0x7fdebae77e40 <waitpid_sleep>, 
    data1=data1@entry=140728104965520, 
    e_proc=e_proc@entry=0x7fdebae77df0 <waitpid_cleanup>, 
    data2=data2@entry=140728104965520) at eval.c:1076
#6  0x00007fdebae7f9a9 in rb_f_system (argc=<optimized out>, 
    argv=<optimized out>) at process.c:4476
#7  0x00007fdebaf3a79b in vm_call_cfunc_with_frame (ci=0x55bc12690490, 
    cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7fdeba6a6da8, 
    ec=0x55bc09ddc328) at vm_insnhelper.c:1908
#8  vm_call_cfunc (ec=0x55bc09ddc328, reg_cfp=0x7fdeba6a6da8, 
    calling=<optimized out>, ci=0x55bc12690490, cc=<optimized out>)
    at vm_insnhelper.c:1924
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) p * fds
$1 = {fd = 3, events = 1, revents = 0}

On the advice of a friend:

(gdb) p (int)close(3)    
[ASYNC BUG] consume_communication_pipe: read(3)
EBADF

ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-linux]

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: https://www.ruby-lang.org/bugreport.html

$2 = 0
(gdb) quit
A debugging session is active.

    Inferior 1 [process 89] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/bin/ruby, process 89
[Inferior 1 (process 89) detached]
root@4f1f0c16d672:/tmp/src# fg
bash: fg: job has terminated
[1]+  Aborted                 rails runner "system 'ls'"

Sure looks like that pipe isn't closing cleanly - but if you force it closed then things exit.

root@4f1f0c16d672:/tmp/src# ls -al /proc/136/fd/3
lrwx------ 1 root root 64 Apr  6 16:29 /proc/136/fd/3 -> 'anon_inode:[eventfd]'
kwerle commented 4 years ago

I should also mention that the app hangs before at_exit fires. But if you hit ctrl-c, the app will then continue and fire at_exit blocks.

kwerle commented 3 years ago

It's interesting that this file descriptor seems to exist for any ruby process. In ruby 2.6.5 it looks like:

lrwx------ 1 root root 64 Feb  2 09:48 3 -> 'anon_inode:[eventfd]'

But in ruby 2.5.8 it is clearly a pipe:

lr-x------ 1 root root 64 Feb  2 09:29 3 -> 'pipe:[146388]'
arton commented 3 years ago

Hi kwerle I have no idea for your case, however it seems that Java consumes SIGCHLD. I wonder if you could test wirh -Xrs option to call Rjb::load. for ex) Rjb::load('.', ['-Xrs'])

kwerle commented 3 years ago

arton, Thank you for your time and efforts! We're already passing that to the loader.

It turns out that updating our java library (chemaxon) has invalidated my trivial example - which now exits. We are, however, still seeing problems exiting some system calls. I have not yet been able to isolate them locally. I will follow up on this ticket once I know more.

akiotajima commented 3 years ago

kwerle Thank you for your collaboration. I'm sorry, I wasn't able to help you for this issue.

kwerle commented 3 years ago

@akiotajima Thank you for this project! It is very useful to us and we will continue to use it gladly.

We can only reproduce this issue on Java 8. When we upgrade to java 11 it no longer happens. We will upgrade!

I'm always a little sad to close an issue without getting to the root cause - but in cases like this I will let it go.