JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
29 stars 11 forks source link

rmprocs / addprocs racy #38

Open vtjnash opened 7 years ago

vtjnash commented 7 years ago

node termination during node provisioning is not well handled, resulting in `connect: connection refused (ECONNREFUSED) in connect_to_worker from the new worker to the terminating worker.

for an example, see: https://travis-ci.org/JuliaLang/julia/jobs/186141590

julia> p = addprocs(2)

julia> begin # try this a couple times
         @spawnat p[1] sleep(5)
         @show rmprocs(p[1]; waitfor=0)
         @show workers()
         @show p = addprocs(1)
       end
rmprocs(p[1]; waitfor=0) = :ok
workers() = [3,4,5]
ERROR: connect: connection refused (ECONNREFUSED)
 in yieldto(::Task, ::ANY) at ./event.jl:153
 in wait() at ./event.jl:186
 in wait(::Condition) at ./event.jl:27
 in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:42
 in wait_connected(::TCPSocket) at ./stream.jl:258
 in connect at ./stream.jl:957 [inlined]
 in connect_to_worker(::String, ::Int16) at ./managers.jl:490
 in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:453
 in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:387
 in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1516
 in (::Base.##598#600{WorkerConfig,Int64})() at ./task.jl:404
Error [connect: connection refused (ECONNREFUSED)] on 6 while connecting to peer 4. Exiting.
Worker 6 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Keno commented 4 years ago

Re-opened because the test that was added to fix this failed intermittently and has been disabled (#35677).