JuliaParallel / ClusterManagers.jl

Other
245 stars 74 forks source link

Hangs on `addprocs_sge()` #56

Closed tamasgal closed 4 years ago

tamasgal commented 7 years ago

I hope someone can help me. I am trying to do parallel computing with Julia on our SGE grid system, which I normally only fed with shell scripts.

When I run for example addprocs_sge(5,res_list="ct=00:01:00"), it immediately shows the received job id and is waiting for the job to start. A few seconds after however, I receive error messages which indicates that it can't tail the log files in my home, which however are present:

julia> using ClusterManagers

julia> addprocs_sge(5,res_list="ct=00:01:00")
job id is 5430281, waiting for job to start ........................
tail: /path/to/my/home/julia-59333.o5430281.1: No such file or directory
tail: no files remaining
tail: /path/to/my/home/julia-59333.o5430281.5: No such file or directory
tail: no files remaining

This is the content of one of the log files, so it apparently fails to run the julia process, which however is of course accessible (I am using the same binary for the REPL):

***************************************************************
* Submitted on:            Tue Feb 07 09:30:16 2017           *
* Started on:              Tue Feb 07 09:31:18 2017           *
***************************************************************

/var/spool/sge/ccwsge0830/job_scripts/5430281:1: permission denied: /path/to/my/home/apps/julia/julia-0.5.0/bin/julia

***************************************************************
* Ended on:                Tue Feb 07 09:31:36 2017           *
* Exit status:             126                                *
* Consumed                                                    *
*   cpu (HS06):            00:00:00                           *
*   cpu scaling factor:    11.100000                          *
*   cpu time:              0 / 60                             *
*   efficiency:            00 %                               *
*   io:                    0.00000GB                          *
*   vmem:                  N/A                                *
*   maxvmem:               N/A                                *
*   maxrss:                N/A                                *
***************************************************************

Any ideas what's happening here?

bjarthur commented 7 years ago

if your SGE cluster supports qrsh as well as qsub, you might try the undocumented QRSHManager instead.

addprocs_qrsh(...

on high performance file systems which agressively buffer file I/O, interprocess communication via TCP is much more reliable.

tamasgal commented 7 years ago

Thanks, I tried but no success:

julia> addprocs_qrsh(5)
got no response from JSV script "/opt/sge/util/resources/jsv/corebinding.jsv"
got no response from JSV script "/opt/sge/util/resources/jsv/corebinding.jsv"
got no response from JSV script "/opt/sge/util/resources/jsv/corebinding.jsv"
got no response from JSV script "/opt/sge/util/resources/jsv/corebinding.jsv"
got no response from JSV script "/opt/sge/util/resources/jsv/corebinding.jsv"
juliohm commented 4 years ago

Too old to reproduce. We've released a new version of the package, please report any issues there if they still apply.

tamasgal commented 4 years ago

Thanks, I'll try!