JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

access to undefined reference in Distributed tests #16

Open fredrikekre opened 3 years ago

fredrikekre commented 3 years ago

From https://github.com/JuliaLang/julia/pull/39591:

tester_linux64:

ERROR: LoadError: On worker 23:
UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty
   @ ./Base.jl:33
 [2] JuliaLang/julia#447
   @ /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.7/Distributed/test/distributed_exec.jl:1607
 [3] JuliaLang/julia#106
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
 [4] run_work_thunk
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
 [5] macro expansion
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
 [6] JuliaLang/julia#105
   @ ./task.jl:406
Stacktrace:
 [1] (::Base.var"#876#878")(x::Task)
   @ Base ./asyncmap.jl:177
 [2] foreach(f::Base.var"#876#878", itr::Vector{Any})
   @ Base ./abstractarray.jl:2146
 [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{Int64})
   @ Base ./asyncmap.jl:177
 [4] wrap_n_exec_twice
   @ ./asyncmap.jl:153 [inlined]
 [5] async_usemap(f::var"#446#448", c::Vector{Int64}; ntasks::Int64, batch_size::Nothing)
   @ Base ./asyncmap.jl:103
 [6] #asyncmap#860
   @ ./asyncmap.jl:81 [inlined]
 [7] asyncmap
   @ ./asyncmap.jl:81 [inlined]
 [8] reuseport_tests()
   @ Main /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.7/Distributed/test/distributed_exec.jl:1601
 [9] top-level scope
   @ /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.7/Distributed/test/distributed_exec.jl:1637
in expression starting at /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.7/Distributed/test/distributed_exec.jl:1636

Log: https://build.julialang.org/#/builders/71/builds/566/steps/5/logs/stdio (uploaded rr trace: rr-run_566-gitsha_2da8c98287-2021-02-10_01_27_46.tar.zst)

vtjnash commented 3 years ago

on julia-28: (rr) p jl_(v)

Distributed.Worker(id=26, del_msgs=Array{Any, (0,)}[], add_msgs=Array{Any, (0,)}[], gcflag=false, state=Distributed.WorkerState(0x00000000), c_state=Base.GenericCondition{Base.AlwaysLockedST}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.AlwaysLockedST(ownertid=1)), ct_time=1.61292e+09, conn_func=nothing, r_stream=#<null>, w_stream=#<null>, w_serializer=#<null>, manager=#<null>, config=#<null>, version=#<null>, initialized=Base.Event(notify=Base.GenericCondition{Base.ReentrantLock}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.ReentrantLock(locked_by=nothing, cond_wait=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), reentrancy_cnt=0)), set=false))

My reading of the code is that this state occurs when the asynchronous startup messages between workers don't happen in precise synchronized order. There's a sleep(1) in the addprocs code to help prevent that, but obviously that doesn't always work very well.