Closed rgavazzi closed 3 years ago
Condor might need a similar fix to https://github.com/JuliaParallel/MPI.jl/pull/222
Too old to reproduce. Please retry with the current stable release and reopen the issue if needed.
As far as I can tell, the problem is stlll present!!! I keep failing launching workers with htcondor. The problem remains the same. telnet keeps complaining:
telnet: connect to address 192.168.1.3: Connection refused
If I directly run "nc -l 8200" on a machine mmm in the cluster and I telnet mmm 820 . Telnet connection succeeds!! It seems to me that equivalent of nc -l command is the listen(portnum) call at line 45 of the condor.jl script...
Anyhow, I'd be interested to read from anyone facing the same issue or not, while using ClusterManagers in a HTCondor scheduler!
I get the following error on my local cluster with htcondor scheduler ( julia version 1.1.0-DEV). 1
The created condor script file seems OK:
The temporary shell script file /home/dir/.julia-htc/julia-1195449.sh seems OK:
All ouput *.o files look like: Trying 192.168.1.3...
All ouput *.e files look like: telnet: connect to address 192.168.1.3: Connection refused
(machinenode.from_which_I_ran.julia has IP address 192.168.1.3 , locally )
Other issue: The method "addprocs_htc(np::Integer) = addprocs(HTCManager(np))" does not seem to allow the specification a a different working directory. In many cases, htcondor will place the julia-1195449.sh and associated files into a temporary scratch working directory where one may want to stay during the worker lifetime. Couldn't we avoid that with a
(dir!=nothing) && println(scriptf, "cd $(Base.shell_escape(dir))")
and addprocs_htc(np::Integer ; dir=nothing ) = addprocs(HTCManager(np) , dir=dir)
change in condor.jl