Making this to track an issue first seen in #4 : some of the tests will call rmprocs(), and after changing CI to run with JULIA_NUM_THREADS=4 the workers can hang until rmprocs() times out and sends SIGQUIT.
Example backtrace:
Backtrace
```julia
From worker 21:
From worker 21: [2110] signal 3: Quit # Timeout, rmprocs() sends SIGQUIT
From worker 21: in expression starting at none:1
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: wait at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_locks.h:130 [inlined]
From worker 21: operator() at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:97 [inlined]
From worker 21: jl_engine_reserve at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:100
From worker 21: engine_reserve at ./compiler/types.jl:408 [inlined]
From worker 21: engine_reserve at ./compiler/types.jl:407 [inlined]
From worker 21: typeinf_ext at ./compiler/typeinfer.jl:1080
From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1176 [inlined]
From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1174 # Start compilation and get stuck in the GC
From worker 21: jfptr_typeinf_ext_toplevel_48134.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: jl_type_infer at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:394
From worker 21: jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:2820
From worker 21: _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3299 [inlined]
From worker 21: ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3495
From worker 21: show_exception_stack at ./errorshow.jl:1015 # Something in an errormonitor fails and we try to print the exception
From worker 21: display_error at ./client.jl:117
From worker 21: #errormonitor##0 at ./task.jl:734
From worker 21: jfptr_YY.errormonitorYY.YY.0_74460.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263 # Switches to one of the remaining tasks
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
From worker 21: jl_safepoint_wait_thread_resume at /cache/build/builder-amdci4-4/julialang/julia-master/src/safepoint.c:271
From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:395 [inlined]
From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:381
From worker 21: unknown function (ip: 0x7ff13a04251f) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: jl_gc_state_set at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:275 [inlined]
From worker 21: maybe_collect at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:268 [inlined]
From worker 21: jl_gc_small_alloc_inner at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:737 [inlined]
From worker 21: jl_gc_small_alloc_noinline at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:795 [inlined]
From worker 21: jl_gc_alloc_ at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:809
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
From worker 21: ijl_task_get_next at /cache/build/builder-amdci4-4/julialang/julia-master/src/scheduler.c:520
From worker 21: poptask at ./task.jl:1158
From worker 21: wait at ./task.jl:1167
From worker 21: task_done_hook at ./task.jl:839
From worker 21: jfptr_task_done_hook_74488.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: jl_finish_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:338
From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1274
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: pthread_cond_destroy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: __cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) # Running finalizers and atexit() handlers?
From worker 21: __do_global_dtors_aux at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line)
From worker 21: _fini at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line)
From worker 21: unknown function (ip: 0x7ff13a045494) at /lib/x86_64-linux-gnu/libc.so.6
From worker 21: exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
From worker 21: ijl_exit at /cache/build/builder-amdci4-4/julialang/julia-master/src/init.c:199
From worker 21: jlplt_ijl_exit_77448.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: exit at ./initdefs.jl:28
From worker 21: exit at ./initdefs.jl:29 # exit() is called
From worker 21: jfptr_exit_77443.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: jl_f__call_latest at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:883
From worker 21: #invokelatest#1 at ./essentials.jl:1049 [inlined]
From worker 21: invokelatest at ./essentials.jl:1046
From worker 21: jfptr_invokelatest_62384.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: do_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:839
From worker 21: #handle_msg##12 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312 # Worker gets call to `exit()` from the master
From worker 21: run_work_thunk at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:72
From worker 21: #handle_msg##10 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312
From worker 21: unknown function (ip: 0x7ff0fb7455bf) at (unknown file)
From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263
From worker 21: unknown function (ip: (nil)) at (unknown file)
From worker 21: Allocations: 9179557 (Pool: 9179436; Big: 121); GC: 8
```
I've only observed this on nightly, almost always on Ubuntu/OSX, almost never on Windows. A couple of times the workers have segfaulted somewhere in LLVM, but I don't have a backtrace for that.
It doesn't happen every time rmprocs() is called. The most reliable trigger is the topology.jl tests, though once or twice I've seen other tests failing.
Making this to track an issue first seen in #4 : some of the tests will call
rmprocs()
, and after changing CI to run withJULIA_NUM_THREADS=4
the workers can hang untilrmprocs()
times out and sends SIGQUIT.Example backtrace:
Backtrace
```julia From worker 21: From worker 21: [2110] signal 3: Quit # Timeout, rmprocs() sends SIGQUIT From worker 21: in expression starting at none:1 From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: wait at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_locks.h:130 [inlined] From worker 21: operator() at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:97 [inlined] From worker 21: jl_engine_reserve at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:100 From worker 21: engine_reserve at ./compiler/types.jl:408 [inlined] From worker 21: engine_reserve at ./compiler/types.jl:407 [inlined] From worker 21: typeinf_ext at ./compiler/typeinfer.jl:1080 From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1176 [inlined] From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1174 # Start compilation and get stuck in the GC From worker 21: jfptr_typeinf_ext_toplevel_48134.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_type_infer at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:394 From worker 21: jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:2820 From worker 21: _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3299 [inlined] From worker 21: ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3495 From worker 21: show_exception_stack at ./errorshow.jl:1015 # Something in an errormonitor fails and we try to print the exception From worker 21: display_error at ./client.jl:117 From worker 21: #errormonitor##0 at ./task.jl:734 From worker 21: jfptr_YY.errormonitorYY.YY.0_74460.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263 # Switches to one of the remaining tasks From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_safepoint_wait_thread_resume at /cache/build/builder-amdci4-4/julialang/julia-master/src/safepoint.c:271 From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:395 [inlined] From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:381 From worker 21: unknown function (ip: 0x7ff13a04251f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: jl_gc_state_set at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:275 [inlined] From worker 21: maybe_collect at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:268 [inlined] From worker 21: jl_gc_small_alloc_inner at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:737 [inlined] From worker 21: jl_gc_small_alloc_noinline at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:795 [inlined] From worker 21: jl_gc_alloc_ at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:809 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: ijl_task_get_next at /cache/build/builder-amdci4-4/julialang/julia-master/src/scheduler.c:520 From worker 21: poptask at ./task.jl:1158 From worker 21: wait at ./task.jl:1167 From worker 21: task_done_hook at ./task.jl:839 From worker 21: jfptr_task_done_hook_74488.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_finish_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:338 From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1274 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: pthread_cond_destroy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: __cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) # Running finalizers and atexit() handlers? From worker 21: __do_global_dtors_aux at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line) From worker 21: _fini at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line) From worker 21: unknown function (ip: 0x7ff13a045494) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: ijl_exit at /cache/build/builder-amdci4-4/julialang/julia-master/src/init.c:199 From worker 21: jlplt_ijl_exit_77448.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: exit at ./initdefs.jl:28 From worker 21: exit at ./initdefs.jl:29 # exit() is called From worker 21: jfptr_exit_77443.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_f__call_latest at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:883 From worker 21: #invokelatest#1 at ./essentials.jl:1049 [inlined] From worker 21: invokelatest at ./essentials.jl:1046 From worker 21: jfptr_invokelatest_62384.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: do_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:839 From worker 21: #handle_msg##12 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312 # Worker gets call to `exit()` from the master From worker 21: run_work_thunk at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:72 From worker 21: #handle_msg##10 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312 From worker 21: unknown function (ip: 0x7ff0fb7455bf) at (unknown file) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: Allocations: 9179557 (Pool: 9179436; Big: 121); GC: 8 ```I've only observed this on nightly, almost always on Ubuntu/OSX, almost never on Windows. A couple of times the workers have segfaulted somewhere in LLVM, but I don't have a backtrace for that.
It doesn't happen every time
rmprocs()
is called. The most reliable trigger is thetopology.jl
tests, though once or twice I've seen other tests failing.