Open nathaliesoy opened 10 months ago
Thanks! This looks like it might be an issue in ClusterManagers.jl https://github.com/JuliaParallel/ClusterManagers.jl/issues/179
What is your qsub --version
?
pbs_version = 20.0.1
Okay this might take a bit longer to solve. It turns out to be really hard to set up a local version of PBS for testing things. But I'm working on it!
https://github.com/JuliaParallel/ClusterManagers.jl/pull/193
Basically what we need to do is modify these lines to fix ClusterManagers.jl:
qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` , (isPBS ?
`qsub -N $jobname -wd $wd -j oe -k o -t 1-$np $queue` :
`qsub -N $jobname -wd $wd -terse -j y -R y -t 1-$np -V $queue`))
It sounds like they haven't yet updated this qsub
call to PBS version 20.
If you are proficient with qsub and know what flags need to be used here, you might be able to make a local modification of ClusterManagers.jl, and then switch to that copy of ClusterManagers.jl with PySR with:
cd ClusterManagers.jl
julia --project=@pysr-0.16.3 -e 'using Pkg; pkg"dev ."'
This will get the PySR environment for 0.16.3 to use the local copy of ClusterManagers.jl. Then if you are able to update the qsub
call in the src/qsub.jl
file to the qsub version 20 syntax, it should work.
Thank you Miles for investigating this! I think I figured out the new PBS 20 flags and changed it accordingly.
So I added these two lines to my submission shell script
cd ClusterManagers.jl
julia --project=@pysr-0.16.3 -e 'using Pkg; pkg"dev ."'
but it doesn't look like it is picking up the local package. The julia version I am using is globally installed on the cluster. I can't recall, does the ClusterManagers.jl need to be in a specific folder? Do I need to set some path somewhere?
Even if the Julia version is globally installed, you should have the environments appear in your local folder ~/.julia/environments
. There should be a pysr-0.16.3
one in that folder (or whatever version of PySR you have installed).
If you open the file ~/.julia/environments/pysr-0.16.3/Manifest.toml
, and go to the "ClusterManagers.jl" section, it should tell you if it is a local version or not, and what folder it is using. Maybe the path name is a relative path rather than absolute? You could also try
julia --project=@pysr-0.16.3 -e 'using Pkg; Pkg.develop(path="/path/to/clustermanagers.jl")'
and give the full absolute path (to the location of your modified ClusterManagers.jl) there?
Oh wait, sorry. I just realized you said in the original post that you are using PySR 0.14.1. So either (1) update to PySR 0.16.3 and go through the normal installation with python -m pysr install
before implementing these changes, or (2) use --project=@pysr-0.14.1
instead of -0.16.3
.
okay so that part seems okay now, thanks! now the issue is that when submitting it can't connect to the server, errno=15010, seems like a permission thing... Probably I should pick it up with our system administrator?
Hm, yeah the sysadmin might know best for that type of issue. How are you running things?
You could also try running a parallel Julia command manually, just to see if it gives a more helpful error message.
First, create an interactive job on the cluster that you can ssh into. Ssh into it and start Julia with: julia --project=@pysr-0.16.3
. Then, execute the following (copy-paste)
import Distributed: pmap
import ClusterManagers: addprocs_pbs
num_workers = 10
# Create the workers:
procs = addprocs_pbs(num_workers)
# Run a computation on each worker:
pmap(worker_id -> worker_id^2, procs)
It should return a vector like [4, 9, 16, ...]
if successful. And each of those computations will have run on a different worker across the PBS allocation.
What happened?
When using the cluster manager on pbs the code breaks. It seems to fail to activate the workers due to wrong qsub flags.
Version
0.14.1
Operating System
Linux
Package Manager
pip
Interface
Script (i.e.,
python my_script.py
)Relevant log output
Extra Info
Setting multithreading to False doesn't change anything.