Open fredrikekre opened 3 years ago
on julia-28:
(rr) p jl_(v)
Distributed.Worker(id=26, del_msgs=Array{Any, (0,)}[], add_msgs=Array{Any, (0,)}[], gcflag=false, state=Distributed.WorkerState(0x00000000), c_state=Base.GenericCondition{Base.AlwaysLockedST}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.AlwaysLockedST(ownertid=1)), ct_time=1.61292e+09, conn_func=nothing, r_stream=#<null>, w_stream=#<null>, w_serializer=#<null>, manager=#<null>, config=#<null>, version=#<null>, initialized=Base.Event(notify=Base.GenericCondition{Base.ReentrantLock}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.ReentrantLock(locked_by=nothing, cond_wait=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.InvasiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), reentrancy_cnt=0)), set=false))
My reading of the code is that this state occurs when the asynchronous startup messages between workers don't happen in precise synchronized order. There's a sleep(1)
in the addprocs
code to help prevent that, but obviously that doesn't always work very well.
From https://github.com/JuliaLang/julia/pull/39591:
tester_linux64:
Log: https://build.julialang.org/#/builders/71/builds/566/steps/5/logs/stdio (uploaded rr trace:
rr-run_566-gitsha_2da8c98287-2021-02-10_01_27_46.tar.zst
)