JuliaParallel / DistributedNext.jl

Bleeding-edge fork of Distributed.jl
http://juliaparallel.org/DistributedNext.jl/
MIT License
7 stars 1 forks source link

Removing workers times out on nightly #6

Open JamesWrigley opened 2 days ago

JamesWrigley commented 2 days ago

Making this to track an issue first seen in #4 : some of the tests will call rmprocs(), and after changing CI to run with JULIA_NUM_THREADS=4 the workers can hang until rmprocs() times out and sends SIGQUIT.

Example backtrace:

Backtrace ```julia From worker 21: From worker 21: [2110] signal 3: Quit # Timeout, rmprocs() sends SIGQUIT From worker 21: in expression starting at none:1 From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550 From worker 21: unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: wait at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_locks.h:130 [inlined] From worker 21: operator() at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:97 [inlined] From worker 21: jl_engine_reserve at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:100 From worker 21: engine_reserve at ./compiler/types.jl:408 [inlined] From worker 21: engine_reserve at ./compiler/types.jl:407 [inlined] From worker 21: typeinf_ext at ./compiler/typeinfer.jl:1080 From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1176 [inlined] From worker 21: typeinf_ext_toplevel at ./compiler/typeinfer.jl:1174 # Start compilation and get stuck in the GC From worker 21: jfptr_typeinf_ext_toplevel_48134.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_type_infer at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:394 From worker 21: jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:2820 From worker 21: _jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3299 [inlined] From worker 21: ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3495 From worker 21: show_exception_stack at ./errorshow.jl:1015 # Something in an errormonitor fails and we try to print the exception From worker 21: display_error at ./client.jl:117 From worker 21: #errormonitor##0 at ./task.jl:734 From worker 21: jfptr_YY.errormonitorYY.YY.0_74460.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263 # Switches to one of the remaining tasks From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: jl_safepoint_wait_thread_resume at /cache/build/builder-amdci4-4/julialang/julia-master/src/safepoint.c:271 From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:395 [inlined] From worker 21: segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:381 From worker 21: unknown function (ip: 0x7ff13a04251f) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: jl_gc_state_set at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:275 [inlined] From worker 21: maybe_collect at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:268 [inlined] From worker 21: jl_gc_small_alloc_inner at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:737 [inlined] From worker 21: jl_gc_small_alloc_noinline at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:795 [inlined] From worker 21: jl_gc_alloc_ at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:809 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822 From worker 21: ijl_task_get_next at /cache/build/builder-amdci4-4/julialang/julia-master/src/scheduler.c:520 From worker 21: poptask at ./task.jl:1158 From worker 21: wait at ./task.jl:1167 From worker 21: task_done_hook at ./task.jl:839 From worker 21: jfptr_task_done_hook_74488.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_finish_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:338 From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1274 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: pthread_cond_destroy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: __cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) # Running finalizers and atexit() handlers? From worker 21: __do_global_dtors_aux at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line) From worker 21: _fini at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line) From worker 21: unknown function (ip: 0x7ff13a045494) at /lib/x86_64-linux-gnu/libc.so.6 From worker 21: exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) From worker 21: ijl_exit at /cache/build/builder-amdci4-4/julialang/julia-master/src/init.c:199 From worker 21: jlplt_ijl_exit_77448.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: exit at ./initdefs.jl:28 From worker 21: exit at ./initdefs.jl:29 # exit() is called From worker 21: jfptr_exit_77443.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: jl_f__call_latest at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:883 From worker 21: #invokelatest#1 at ./essentials.jl:1049 [inlined] From worker 21: invokelatest at ./essentials.jl:1046 From worker 21: jfptr_invokelatest_62384.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: do_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:839 From worker 21: #handle_msg##12 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312 # Worker gets call to `exit()` from the master From worker 21: run_work_thunk at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:72 From worker 21: #handle_msg##10 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312 From worker 21: unknown function (ip: 0x7ff0fb7455bf) at (unknown file) From worker 21: jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined] From worker 21: start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263 From worker 21: unknown function (ip: (nil)) at (unknown file) From worker 21: Allocations: 9179557 (Pool: 9179436; Big: 121); GC: 8 ```

I've only observed this on nightly, almost always on Ubuntu/OSX, almost never on Windows. A couple of times the workers have segfaulted somewhere in LLVM, but I don't have a backtrace for that.

It doesn't happen every time rmprocs() is called. The most reliable trigger is the topology.jl tests, though once or twice I've seen other tests failing.