JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

Error during job creation, leaves stale jobs #114

Open jishnub opened 5 years ago

jishnub commented 5 years ago

I am encountering this error if jobs time out

julia> addprocs_slurm(100);
srun: job 1218546 queued and waiting for resources
Error launching Slurm job:
ERROR: UndefVarError: warn not defined
Stacktrace:
 [1] wait(::Task) at ./task.jl:191
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:418
 [3] addprocs_locked at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:372 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [5] #addprocs_slurm#15 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:359 [inlined]
 [6] addprocs_slurm(::Int64) at /home/jb6888/.julia/packages/ClusterManagers/7pPEP/src/slurm.jl:85
 [7] top-level scope at none:0

The issue seems to be with @async_launch in cluster.jl. However, even after the error, the job is left pending on the queue and might be allocated resources later.

squeue -u jb6888                                                                                                                                                                                                                                                                
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1218546   par_std julia-14  jb6888 PD       0:00      4 (Priority)

Shouldn't an error launching jobs remove it from the queue as well? Or is it still there because the warn error prevents subsequent clean-up from taking place?

vchuravy commented 5 years ago

Cleanup is normally performed when a process shuts down on the Compute node, so you are right we could and should do a better job with error handling here.