crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.5k stars 1.62k forks source link

Interpreter: Process#run broken with defunct child process #12241

Open cyangle opened 2 years ago

cyangle commented 2 years ago

Start the repl, then run another shell command via Process, then it's frozen forever.

The child process is defunct.

$ cr i
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
icr:1:0> status = Process.run("ls")
$ ps aux | grep ls
chao      185966  0.0  0.0      0     0 pts/4    Z+   01:29   0:00 [ls] <defunct>
cyangle commented 2 years ago

With the latest bug fixes, the interpreter could run all kemal tests successfully except the run_spec.cr file which creates new processes with Process.

@asterite @straight-shoota @sdogruyol

$ cr i spec/all_spec.cr 
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
....................................................................................................

Finished in 1.58 seconds
100 examples, 0 failures, 0 errors, 0 pending

real    0m3.263s
user    0m3.642s
sys     0m0.303s
asterite commented 2 years ago

Fork in interpreted mode is broken and I don't understand why yet. I debugged this a lot in the past. This will likely not be fixed any time soon.

potomak commented 2 years ago

Hello,

I'm not a unix expert, but I was looking a bit into this and on Darwin 19.6.0 x86_64, status = Process.run("ls") stops at LibC.execvp(command, argv) in Process.try_replace that is executed in the child process started by LibC.fork in Process.fork.

Just as an experiment I tried to comment out the line that executes the command (LibC.execvp(command, argv)) and I get an EINVAL error from the parent and the Process.run() call terminates trivially.

The input of LibC.execvp(command, argv) is:

Any idea why LibC.execvp(command, argv) doesn't return?

How would you continue debugging this issue?

asterite commented 2 years ago

It's because there's something wrong in the way fork is executed in the interpreter. And it's really hard to debug.

To begin with, we should start finding out why this happens:

pid = Crystal::System::Process.fork(will_exec: true)
if pid
  puts "parent with child: #{pid}"
else
  debugger
  puts "child"
end

The output of running that through crystal i file.cr is:

[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
...

My guess is that we'll need to make the interpreter aware of fork, but I'm not sure.

cyangle commented 2 years ago

@straight-shoota This is a very complex bug and I think it is worth tracking it in #11555.

cyangle commented 2 years ago

@asterite I have a partial fix for this defunct process issue.

The signal handlers were skipped for interpreted mode.

But it only fixes part of the issue.

Try interpreting spec/std/process_spec.cr with and without this patch.

Somehow, the example code works with release build but not debug build.

diff --git a/src/kernel.cr b/src/kernel.cr
index 9f63d2ffc..0af75e9a7 100644
--- a/src/kernel.cr
+++ b/src/kernel.cr
@@ -523,7 +523,7 @@ end
   end
 {% end %}

-{% unless flag?(:interpreted) || flag?(:wasm32) %}
+{% unless flag?(:wasm32) %}
   {% unless flag?(:win32) %}
     # Background loop to cleanup unused fiber stacks.
     spawn(name: "Fiber Clean Loop") do
@@ -536,6 +536,7 @@ end
     Signal.setup_default_handlers
   {% end %}

+  {% unless flag?(:interpreted) %}
   # load debug info on start up of the program is executed with CRYSTAL_LOAD_DEBUG_INFO=1
   # this will make debug info available on print_frame that is used by Crystal's segfault handler
   #
@@ -544,6 +545,7 @@ end
   # - Other values will load debug info on demand: when the backtrace of the first exception is generated
   Exception::CallStack.load_debug_info if ENV["CRYSTAL_LOAD_DEBUG_INFO"]? == "1"
   Exception::CallStack.setup_crash_handler
+  {% end %}

   {% if flag?(:preview_mt) %}
     Crystal::Scheduler.init_workers
cyangle commented 2 years ago

With a release mode compiler built from my git branch, I could run all tests of kemal and lucky successfully with the interpreter.

@asterite @jwoertink @sdogruyol

$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.5.0 [612e52b91] (2022-07-12)

LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu

OS: Ubuntu 22.04
asterite commented 2 years ago

Cool!

What happens if you run this code with the interpreter?

loop do
  puts `ls`
end
cyangle commented 2 years ago

I got different things after a few iterations, I got the enum error most of the times:

Invalid memory access (signal 11) at address 0x8
shift?|535|5|/home/chao/git/personal/crystal/src/deque.crunlock|122|18|/home/chao/git/personal/crystal/src/mutex.crsync|26|7|/home/chao/git/personal/crystal/src/crystal/spin_lock.crunlock|117|5|/home/chao/git/personal/crystal/src/mutex.crwait|334|7|/home/chao/git/personal/crystal/src/signal.crinitialize|11|16|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crnew|10|3|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crinitialize:shell:input:output:error|230|21|/home/chao/git/personal/crystal/src/process.crnew:shell:input:output:error|221|3|/home/chao/git/personal/crystal/src/process.cr`|436|13|/home/chao/git/personal/crystal/src/process.cr.crystal/bugs/loop.cr|2|8|.crystal/bugs/loop.crloop|103|5|/home/chao/git/personal/crystal/src/kernel.cr.crystal/bugs/loop.cr|1|1|.crystal/bugs/loop.cr
In: Hash(Int32, Int32)#fit_in_indices
Node: (1_u32 << @indices_size_pow2) - 1
Arithmetic overflow (OverflowError)
  from src/hash.cr:522:32 in 'find_entry'
  from src/hash.cr:1186:5 in 'solve'
  from src/compiler/crystal/types.cr:1990:13 in 'replace_type_parameters'
  from src/compiler/crystal/types.cr:1976:9 in 'parents'
  from src/compiler/crystal/types.cr:385:7 in 'collect_ancestors'
  from src/indexable.cr:558:11 in 'find_base_generic_instantiation'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:156:22 in 'implements?'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:96:20 in 'implements_with_ancestors?'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:77:15 in 'check_implemented_in_subtypes'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:53:13 in 'check_single'
  from src/int.cr:540:7 in 'run'
  from src/compiler/crystal/progress_tracker.cr:23:20 in 'semantic'
  from src/gc/boehm.cr:129:5 in 'crystal_raise_overflow_compiled_def'
  from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
  from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
  from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
  from src/gc/boehm.cr:129:5 in 'repl'
  from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
  from src/crystal/main.cr:115:5 in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???
Unhandled case: enum value outside of defined enum members (Exception)
  from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
  from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
  from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
  from src/gc/boehm.cr:129:5 in 'repl'
  from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
  from src/crystal/main.cr:115:5 in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???
Unable to load dwarf information: Unhandled case: enum value outside of defined enum members (Exception)
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
  from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in 'llseek'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in 'dl_iterate_phdr'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '__crystal_main'
  from /home/chao/git/personal/crystal/.build/crystal in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???

Sometimes it's frozen.

cyangle commented 2 years ago

I spent some time digging into this issue, here's what I found:

So we'll need:

I'm stopping here because I have no idea how to fix or work around this issue.

My findings might be wrong but I hope it could help other people to fix this issue if my findings are proved to be correct.

@asterite @straight-shoota

cyangle commented 2 years ago

It seems like the GC is crashing the interpreter, below code works fine with compiler built from current master branch:

@asterite @straight-shoota , please try it out.

require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
count = 0
loop do
  puts count
  puts `ls`
  count = count + 1
end

System info:

$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.6.0-dev [aab3574bf] (2022-07-13)

LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu

OS: Ubuntu 22.04
cyangle commented 2 years ago

The interpreter could run all tests from spec/std/process_spec.cr successfully after adding ::Signal.setup_default_handlers at the beginning of the file and disabling GC with GC_DONT_GC=1.

GC_DONT_GC=1 bin/crystal i spec/std/process_spec.cr
cyangle commented 2 years ago

@asterite I think the debugger works fine with forking, you can try below code out with the interpreter:

Example code:

require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers

pid = Crystal::System::Process.fork(will_exec: true)
if pid
  puts "#{Process.pid}: from parent start"
  puts "#{Process.pid}: from parent stdin #{STDIN}"
  puts "#{Process.pid}: from parent stdout #{STDOUT}"
  puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
  debugger
  puts "#{Process.pid}: from parent end"
else
  puts "#{Process.pid}: from child start"
  puts "#{Process.pid}: from child stdin #{STDIN}"
  puts "#{Process.pid}: from child stdout #{STDOUT}"
  debugger
  puts "#{Process.pid}: from child end"
end

You need to block to wait in the parent, otherwise parent exits and the stdin and stdout would be closed. Interactive session log:

$ cr i .crystal/bugs/fork.cr 
Using compiled compiler at .build/crystal
loop.cr: 12316 Crystal::SignalChildHandler: 11479924585718654121
12316: from parent start
12376: from child start
12316: from parent stdin #<IO::FileDescriptor:0x7f71cbe90800>
12376: from child stdin #<IO::FileDescriptor:0x7f71cbe90800>
12316: from parent stdout #<IO::FileDescriptor:0x7f71cbe90700>
12376: from child stdout #<IO::FileDescriptor:0x7f71cbe90700>
From: .crystal/bugs/fork.cr:19:8 <Program>#.crystal/bugs/fork.cr:

    14: else
    15:   puts "#{Process.pid}: from child start"
    16:   puts "#{Process.pid}: from child stdin #{STDIN}"
    17:   puts "#{Process.pid}: from child stdout #{STDOUT}"
    18:   debugger
 => 19:   puts "#{Process.pid}: from child end"
    20: end

pry> Process.pid
12376
pry> continue
12376: from child end
12316: child 0
From: .crystal/bugs/fork.cr:13:8 <Program>#.crystal/bugs/fork.cr:

     8:   puts "#{Process.pid}: from parent start"
     9:   puts "#{Process.pid}: from parent stdin #{STDIN}"
    10:   puts "#{Process.pid}: from parent stdout #{STDOUT}"
    11:   puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
    12:   debugger
 => 13:   puts "#{Process.pid}: from parent end"
    14: else
    15:   puts "#{Process.pid}: from child start"
    16:   puts "#{Process.pid}: from child stdin #{STDIN}"
    17:   puts "#{Process.pid}: from child stdout #{STDOUT}"
    18:   debugger

pry> Process.pid
12316
pry> continue
12316: from parent end
asterite commented 2 years ago

Thanks for the investigation!

I don't think I'll have time to look at this. But the gist is: if it works with the GC disabled, then it definitely doesn't work. It should work with the GC enabled. If it crashes with the GC on it means we aren't tracking memory properly. The problem with that is that it's really hard to know where we are doing something wrong! So it will require a lot of investigation and narrowing down things.

cyangle commented 2 years ago

For now, I'll keep using my workaround for this issue.

And here is my current workaround in case someone needs it.

{% if flag?(:interpreted) %}
  GC.disable
  Signal::CHLD.reset
  Crystal::Signal.start_loop
{% end %}
ysbaddaden commented 12 months ago

The problem isn't with fork or maybe it was but this has long been fixed.

What happens is that calling an external process (e.g. Process.run) eventually waits for the process to terminate but we can't use waitpid or another syscall that would block the current thread, so the crystal runtime waits for a SIGCHLD signal to keep other fibers going, and resume the pending fiber when the signal is received.

The problem is that the interpreted code will never receive the signal, because the crystal interpreter will be the one receiving it, but since it doesn't know anything about that child process, it likely discards it... and the interpreted code hangs forever.

I'm not sure how this can be fixed. I'm not sure the interpreted code taking over the signals is a good idea. Maybe the interpreter should forward signals to the interpreted code?

ysbaddaden commented 5 months ago

We probably need an interpreter intrinsic for the interpreted code to tell the interpreter that it wants to register a signal handler, have the interpreter register the signal and remember it, then have the signal handler notify the interpreted code when it receives the signal.

The interpreter could write directly to the signal loop file descriptor of the interpreted code, which may need another intrinsic to setup (or simply be an arg of the register intrinsic above). Then the signal loop would work just as normal in the interpreted code.

The interpreter can handle SIG_DFL and SIG_IGN directly, so I don't think we need an unregister signal intrinsic.

straight-shoota commented 5 months ago

We could also consider putting signal handling on the event loop. That's an orthogonal issue, but might offer some alternative for interpreter integration.

ysbaddaden commented 5 months ago

Oh, it seems that once the interpreter starts interpreting, the signal loop fiber won't be scheduled anymore, and signals won't be handled anymore :thinking:

ysbaddaden commented 5 months ago

Oh my, the patch is incredibly small. Just add the following to src/kernel.cr and it works:

+{% if flag?(:interpreted) && flag?(:unix) %}
+  Crystal::System::Signal.setup_default_handlers
+{% end %}

The interpreter won't receive signals anymore but it won't process them anyway, so it's better to let the interpreted code receive the signal, and handle it properly.