`IOError: connect: connection refused (ECONNREFUSED)`

omus commented 3 years ago

[ Info: test-multi-addprocs-x7wr9-worker-c55hc is up
[ Info: test-multi-addprocs-x7wr9-worker-sldrh is up
ERROR: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::K8sClusterManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:364
 [2] macro expansion
   @ ./task.jl:383 [inlined]
 [3] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [5] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [6] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [7] top-level scope
   @ none:11

I originally noticed this error as part of another PR (https://github.com/beacon-biosignals/K8sClusterManagers.jl/pull/44#issuecomment-826976309) but have also observed this failure when removing this sleep call.

I believe what is happening is that we are waiting for the pod to be running but attempt to connect to the pod before the worker actually starts listening. Since the original failure as shown above occurred before this sleep(2) call was removed probably there is some variability in how long it takes Julia to start listening.

omus commented 3 years ago

One possible solution to this problem would be to use kubectl logs and wait for the julia_worker:<port>#<ip> message. This would also no longer require a port to be specified as the manager could just read it from the worker logs.

omus commented 3 years ago

Another CI example: https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/2442606710

omus commented 3 years ago

Another one: https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/2442682308. I'm increasing the duration waited from 2 seconds to 4 seconds in #47 to try an work around the problem for now.

omus commented 3 years ago

Another viable option would be to use: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate

omus commented 3 years ago

Should be fixed by #57 as the cluster manager will now wait for the output from the worker before trying to connect.

beacon-biosignals / K8sClusterManagers.jl

`IOError: connect: connection refused (ECONNREFUSED)` #46