JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

Deadlock during Julia image generation #54200

Open vchuravy opened 5 months ago

vchuravy commented 5 months ago

I recently observed a deadlock, that seems to occur when we attempt to JIT compile a function during the emission of Julia code.

LLVM.jl installs a error handler that roughly looks like this:

function handle_error(reason::Cstring)
    throw(LLVMException(unsafe_string(reason)))
end

function _install_handlers()
    handler = @cfunction(handle_error, Cvoid, (Cstring,))
    ccall((:LLVMInstallFatalErrorHandler, libllvm), Cvoid, (Ptr{Cvoid},), handler)
end

Using the profiler to get a backtrace:

cmd: /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/julia 18641 running 2 of 2

signal (10): User defined signal 1
unknown function (ip: 0x7c1c1496f10e)
pthread_mutex_lock at /usr/lib/libc.so.6 (unknown line)
__gthread_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:749 [inlined]
__gthread_recursive_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:811 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/mutex:106 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:141 [inlined]
unique_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:71 [inlined]
Lock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:42 [inlined]
getLock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:69
jl_codegen_params_t at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.h:258 [inlined]
_jl_compile_codeinst at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:213
jl_generate_fptr_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:528
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2534 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2421
_jl_invoke at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2938 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:3123
handle_error at /home/vchuravy/.julia/packages/LLVM/bzSzE/src/core/context.jl:168
jfptr_handle_error_5213 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
jlcapi_handle_error_5773 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
_ZN4llvm18report_fatal_errorERKNS_5TwineEb at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel15CannotYetSelectEPNS_6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel6SelectEPN4llvm6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel20runOnMachineFunctionERN4llvm15MachineFunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
add_output_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1171
operator() at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1477
operator() at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_function.h:690 [inlined]
lambda_trampoline at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1347
unknown function (ip: 0x7c1c14972559)
unknown function (ip: 0x7c1c149efa3b)
unknown function (ip: (nil))
unknown function (ip: 0x7c1c1496eebc)
unknown function (ip: 0x7c1c149740e2)
uv_thread_join at /workspace/srcdir/libuv/src/unix/thread.c:294
add_output<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1485
operator()<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1645 [inlined]
jl_dump_native_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1790
ijl_write_compiler_output at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/precompile.c:168
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:285
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7c1c1490cccf)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))

My hypothesis is that the two locks involved are:

Lock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:42 

and https://github.com/JuliaLang/julia/blob/08e1fc0abb959ce5bd4c75b05518a41b85e4aba1/src/aotcompile.cpp#L1785-L1786

and that we end up re-using the context and therefore the lock.

@pchintalapudi any thoughts?

gbaraldi commented 5 months ago

The thing I'm a bit puzzled about is that we sometimes use getContext and sometimes we use AcquireContext and acquireContext seems more correct?

pchintalapudi commented 5 months ago

getContext will automatically return the context to the pool of contexts when its object is destroyed, while acquireContext should be paired with a releaseContext.

Also, I think it's wrong to trigger additional compilation from within orc itself; I'm pretty sure there's some assumptions that are made about not touching the runtime within the addModule/lookup calls for thread safety purposes.