beacon-biosignals / K8sClusterManagers.jl

A Julia cluster manager for Kubernetes
Other
31 stars 5 forks source link

Unable to determine pod name #77

Open omus opened 3 years ago

omus commented 3 years ago

@kolia reported this issue with K8sClusterManagers@0.1.2:

julia> addprocs(K8sClusterManager(n_workers; pending_timeout=180, memory="1Gi"))
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-z4jjs is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-fvt5b is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-jt2dv is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-gwtsp is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-pv5dw is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-dlzxx is up
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:497
 [3] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [4] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [6] top-level scope
   @ REPL[16]:1
    nested task error: TaskFailedException
        nested task error: Unable to determine the pod name from: ""
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:33
         [2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
           @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/pod.jl:68
         [3] macro expansion
           @ ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:107 [inlined]
         [4] (::K8sClusterManagers.var"#29#31"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
           @ K8sClusterManagers ./task.jl:411
    ...and 25 more exceptions.
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:369
     [2] macro expansion
       @ ./task.jl:388 [inlined]
     [3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
       @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:105
     [4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:411
omus commented 3 years ago

The nested task error: Unable to determine the pod name from: "" is from create_pod and shows that the external command call resulted in no stdout (the empty string reported) and no stderr (a different exception would have been raised) from the process. I'll note we're using ignorestatus so possibly the return code here could be useful. One theory I have is that since the launch call happens inside of a task maybe it's possible that output could be missed if Julia was busy with another task.

Additionally, there are another 25 error messages we're not seeing which could be useful for determining the root cause.

ericphanson commented 2 years ago

I just ran into this too; I asked for 6 workers, and it seemed to happen on the 6th (since I got 5 "worker is up" log messages before it failed; no other log messages though). Partial stacktrace:

TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:504
 [3] addprocs(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
[truncated]
    nested task error: TaskFailedException

        nested task error: Unable to determine the pod name from: ""
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:33
         [2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
           @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/pod.jl:66
         [3] macro expansion
           @ ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:103 [inlined]
         [4] (::K8sClusterManagers.var"#17#18"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
           @ K8sClusterManagers ./task.jl:423
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:381
     [2] macro expansion
       @ ./task.jl:400 [inlined]
     [3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
       @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:101
     [4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:423

On K8sClusterManagers v0.1.3.