Open omus opened 3 years ago
The nested task error: Unable to determine the pod name from: ""
is from create_pod
and shows that the external command call resulted in no stdout (the empty string reported) and no stderr (a different exception would have been raised) from the process. I'll note we're using ignorestatus
so possibly the return code here could be useful. One theory I have is that since the launch
call happens inside of a task maybe it's possible that output could be missed if Julia was busy with another task.
Additionally, there are another 25 error messages we're not seeing which could be useful for determining the root cause.
I just ran into this too; I asked for 6 workers, and it seemed to happen on the 6th (since I got 5 "worker is up" log messages before it failed; no other log messages though). Partial stacktrace:
TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:334 [inlined]
[2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:504
[3] addprocs(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
[truncated]
nested task error: TaskFailedException
nested task error: Unable to determine the pod name from: ""
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
@ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/pod.jl:66
[3] macro expansion
@ ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:103 [inlined]
[4] (::K8sClusterManagers.var"#17#18"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
@ K8sClusterManagers ./task.jl:423
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:381
[2] macro expansion
@ ./task.jl:400 [inlined]
[3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
@ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:101
[4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
@ Distributed ./task.jl:423
On K8sClusterManagers v0.1.3.
@kolia reported this issue with K8sClusterManagers@0.1.2: