Open cyangle opened 2 years ago
With the latest bug fixes, the interpreter could run all kemal tests successfully except the run_spec.cr
file which creates new processes with Process
.
@asterite @straight-shoota @sdogruyol
$ cr i spec/all_spec.cr
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
....................................................................................................
Finished in 1.58 seconds
100 examples, 0 failures, 0 errors, 0 pending
real 0m3.263s
user 0m3.642s
sys 0m0.303s
Fork in interpreted mode is broken and I don't understand why yet. I debugged this a lot in the past. This will likely not be fixed any time soon.
Hello,
I'm not a unix expert, but I was looking a bit into this and on Darwin 19.6.0 x86_64, status = Process.run("ls")
stops at LibC.execvp(command, argv)
in Process.try_replace
that is executed in the child process started by LibC.fork
in Process.fork
.
Just as an experiment I tried to comment out the line that executes the command (LibC.execvp(command, argv)
) and I get an EINVAL
error from the parent and the Process.run()
call terminates trivially.
The input of LibC.execvp(command, argv)
is:
command: "ls"
argv: [Pointer(UInt8)@0x116920f6c, Pointer(UInt8).null]
Any idea why LibC.execvp(command, argv)
doesn't return?
How would you continue debugging this issue?
It's because there's something wrong in the way fork
is executed in the interpreter. And it's really hard to debug.
To begin with, we should start finding out why this happens:
pid = Crystal::System::Process.fork(will_exec: true)
if pid
puts "parent with child: #{pid}"
else
debugger
puts "child"
end
The output of running that through crystal i file.cr
is:
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
...
My guess is that we'll need to make the interpreter aware of fork, but I'm not sure.
@straight-shoota This is a very complex bug and I think it is worth tracking it in #11555.
@asterite I have a partial fix for this defunct process issue.
The signal handlers were skipped for interpreted mode.
But it only fixes part of the issue.
Try interpreting spec/std/process_spec.cr
with and without this patch.
Somehow, the example code works with release build but not debug build.
diff --git a/src/kernel.cr b/src/kernel.cr
index 9f63d2ffc..0af75e9a7 100644
--- a/src/kernel.cr
+++ b/src/kernel.cr
@@ -523,7 +523,7 @@ end
end
{% end %}
-{% unless flag?(:interpreted) || flag?(:wasm32) %}
+{% unless flag?(:wasm32) %}
{% unless flag?(:win32) %}
# Background loop to cleanup unused fiber stacks.
spawn(name: "Fiber Clean Loop") do
@@ -536,6 +536,7 @@ end
Signal.setup_default_handlers
{% end %}
+ {% unless flag?(:interpreted) %}
# load debug info on start up of the program is executed with CRYSTAL_LOAD_DEBUG_INFO=1
# this will make debug info available on print_frame that is used by Crystal's segfault handler
#
@@ -544,6 +545,7 @@ end
# - Other values will load debug info on demand: when the backtrace of the first exception is generated
Exception::CallStack.load_debug_info if ENV["CRYSTAL_LOAD_DEBUG_INFO"]? == "1"
Exception::CallStack.setup_crash_handler
+ {% end %}
{% if flag?(:preview_mt) %}
Crystal::Scheduler.init_workers
With a release mode compiler built from my git branch, I could run all tests of kemal and lucky successfully with the interpreter.
@asterite @jwoertink @sdogruyol
$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.5.0 [612e52b91] (2022-07-12)
LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu
OS: Ubuntu 22.04
Cool!
What happens if you run this code with the interpreter?
loop do
puts `ls`
end
I got different things after a few iterations, I got the enum
error most of the times:
Invalid memory access (signal 11) at address 0x8
shift?|535|5|/home/chao/git/personal/crystal/src/deque.crunlock|122|18|/home/chao/git/personal/crystal/src/mutex.crsync|26|7|/home/chao/git/personal/crystal/src/crystal/spin_lock.crunlock|117|5|/home/chao/git/personal/crystal/src/mutex.crwait|334|7|/home/chao/git/personal/crystal/src/signal.crinitialize|11|16|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crnew|10|3|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crinitialize:shell:input:output:error|230|21|/home/chao/git/personal/crystal/src/process.crnew:shell:input:output:error|221|3|/home/chao/git/personal/crystal/src/process.cr`|436|13|/home/chao/git/personal/crystal/src/process.cr.crystal/bugs/loop.cr|2|8|.crystal/bugs/loop.crloop|103|5|/home/chao/git/personal/crystal/src/kernel.cr.crystal/bugs/loop.cr|1|1|.crystal/bugs/loop.cr
In: Hash(Int32, Int32)#fit_in_indices
Node: (1_u32 << @indices_size_pow2) - 1
Arithmetic overflow (OverflowError)
from src/hash.cr:522:32 in 'find_entry'
from src/hash.cr:1186:5 in 'solve'
from src/compiler/crystal/types.cr:1990:13 in 'replace_type_parameters'
from src/compiler/crystal/types.cr:1976:9 in 'parents'
from src/compiler/crystal/types.cr:385:7 in 'collect_ancestors'
from src/indexable.cr:558:11 in 'find_base_generic_instantiation'
from src/compiler/crystal/semantic/abstract_def_checker.cr:156:22 in 'implements?'
from src/compiler/crystal/semantic/abstract_def_checker.cr:96:20 in 'implements_with_ancestors?'
from src/compiler/crystal/semantic/abstract_def_checker.cr:77:15 in 'check_implemented_in_subtypes'
from src/compiler/crystal/semantic/abstract_def_checker.cr:53:13 in 'check_single'
from src/int.cr:540:7 in 'run'
from src/compiler/crystal/progress_tracker.cr:23:20 in 'semantic'
from src/gc/boehm.cr:129:5 in 'crystal_raise_overflow_compiled_def'
from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
from src/gc/boehm.cr:129:5 in 'repl'
from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
from src/crystal/main.cr:115:5 in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Unhandled case: enum value outside of defined enum members (Exception)
from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
from src/gc/boehm.cr:129:5 in 'repl'
from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
from src/crystal/main.cr:115:5 in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Unable to load dwarf information: Unhandled case: enum value outside of defined enum members (Exception)
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in 'llseek'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in 'dl_iterate_phdr'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '__crystal_main'
from /home/chao/git/personal/crystal/.build/crystal in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Sometimes it's frozen.
I spent some time digging into this issue, here's what I found:
Crystal::SignalChildHandler
setup and processing signals from child processes and I think Crystal::SignalChildHandler
is in the compiled binary.Crystal::SignalChildHandler
which I think is created by the interpreter and managed by the VM.Crystal::SignalChildHandler
, the one on binary has default handlers setup while the one on VM does not. That's why child processes created by interpreted code are defunct. Try printing out the hash of Crystal::SignalChildHandler
from compiled and interpreted code, they are different.Crystal::SignalChildHandler
from the binary since we have both of them processing signals from child processes thus crashing the interpreterSo we'll need:
Crystal::SignalChildHandler
that is shared between the binary and the VM managed codeldd
on start up which will be defunct and the process freezes due to waiting for signals. I guess it's coming from shell expansion in link annotationsI'm stopping here because I have no idea how to fix or work around this issue.
My findings might be wrong but I hope it could help other people to fix this issue if my findings are proved to be correct.
@asterite @straight-shoota
It seems like the GC is crashing the interpreter, below code works fine with compiler built from current master branch:
@asterite @straight-shoota , please try it out.
require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
count = 0
loop do
puts count
puts `ls`
count = count + 1
end
System info:
$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.6.0-dev [aab3574bf] (2022-07-13)
LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu
OS: Ubuntu 22.04
The interpreter could run all tests from spec/std/process_spec.cr
successfully after adding ::Signal.setup_default_handlers
at the beginning of the file and disabling GC with GC_DONT_GC=1
.
GC_DONT_GC=1 bin/crystal i spec/std/process_spec.cr
@asterite I think the debugger works fine with forking, you can try below code out with the interpreter:
Example code:
require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
pid = Crystal::System::Process.fork(will_exec: true)
if pid
puts "#{Process.pid}: from parent start"
puts "#{Process.pid}: from parent stdin #{STDIN}"
puts "#{Process.pid}: from parent stdout #{STDOUT}"
puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
debugger
puts "#{Process.pid}: from parent end"
else
puts "#{Process.pid}: from child start"
puts "#{Process.pid}: from child stdin #{STDIN}"
puts "#{Process.pid}: from child stdout #{STDOUT}"
debugger
puts "#{Process.pid}: from child end"
end
You need to block to wait in the parent, otherwise parent exits and the stdin and stdout would be closed. Interactive session log:
$ cr i .crystal/bugs/fork.cr
Using compiled compiler at .build/crystal
loop.cr: 12316 Crystal::SignalChildHandler: 11479924585718654121
12316: from parent start
12376: from child start
12316: from parent stdin #<IO::FileDescriptor:0x7f71cbe90800>
12376: from child stdin #<IO::FileDescriptor:0x7f71cbe90800>
12316: from parent stdout #<IO::FileDescriptor:0x7f71cbe90700>
12376: from child stdout #<IO::FileDescriptor:0x7f71cbe90700>
From: .crystal/bugs/fork.cr:19:8 <Program>#.crystal/bugs/fork.cr:
14: else
15: puts "#{Process.pid}: from child start"
16: puts "#{Process.pid}: from child stdin #{STDIN}"
17: puts "#{Process.pid}: from child stdout #{STDOUT}"
18: debugger
=> 19: puts "#{Process.pid}: from child end"
20: end
pry> Process.pid
12376
pry> continue
12376: from child end
12316: child 0
From: .crystal/bugs/fork.cr:13:8 <Program>#.crystal/bugs/fork.cr:
8: puts "#{Process.pid}: from parent start"
9: puts "#{Process.pid}: from parent stdin #{STDIN}"
10: puts "#{Process.pid}: from parent stdout #{STDOUT}"
11: puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
12: debugger
=> 13: puts "#{Process.pid}: from parent end"
14: else
15: puts "#{Process.pid}: from child start"
16: puts "#{Process.pid}: from child stdin #{STDIN}"
17: puts "#{Process.pid}: from child stdout #{STDOUT}"
18: debugger
pry> Process.pid
12316
pry> continue
12316: from parent end
Thanks for the investigation!
I don't think I'll have time to look at this. But the gist is: if it works with the GC disabled, then it definitely doesn't work. It should work with the GC enabled. If it crashes with the GC on it means we aren't tracking memory properly. The problem with that is that it's really hard to know where we are doing something wrong! So it will require a lot of investigation and narrowing down things.
For now, I'll keep using my workaround for this issue.
And here is my current workaround in case someone needs it.
{% if flag?(:interpreted) %}
GC.disable
Signal::CHLD.reset
Crystal::Signal.start_loop
{% end %}
The problem isn't with fork
or maybe it was but this has long been fixed.
What happens is that calling an external process (e.g. Process.run
) eventually waits for the process to terminate but we can't use waitpid
or another syscall that would block the current thread, so the crystal runtime waits for a SIGCHLD signal to keep other fibers going, and resume the pending fiber when the signal is received.
The problem is that the interpreted code will never receive the signal, because the crystal interpreter will be the one receiving it, but since it doesn't know anything about that child process, it likely discards it... and the interpreted code hangs forever.
I'm not sure how this can be fixed. I'm not sure the interpreted code taking over the signals is a good idea. Maybe the interpreter should forward signals to the interpreted code?
We probably need an interpreter intrinsic for the interpreted code to tell the interpreter that it wants to register a signal handler, have the interpreter register the signal and remember it, then have the signal handler notify the interpreted code when it receives the signal.
The interpreter could write directly to the signal loop file descriptor of the interpreted code, which may need another intrinsic to setup (or simply be an arg of the register intrinsic above). Then the signal loop would work just as normal in the interpreted code.
The interpreter can handle SIG_DFL
and SIG_IGN
directly, so I don't think we need an unregister signal intrinsic.
We could also consider putting signal handling on the event loop. That's an orthogonal issue, but might offer some alternative for interpreter integration.
Oh, it seems that once the interpreter starts interpreting, the signal loop fiber won't be scheduled anymore, and signals won't be handled anymore :thinking:
Oh my, the patch is incredibly small. Just add the following to src/kernel.cr
and it works:
+{% if flag?(:interpreted) && flag?(:unix) %}
+ Crystal::System::Signal.setup_default_handlers
+{% end %}
The interpreter won't receive signals anymore but it won't process them anyway, so it's better to let the interpreted code receive the signal, and handle it properly.
Start the repl, then run another shell command via
Process
, then it's frozen forever.The child process is defunct.