JuliaParallel / ClusterManagers.jl

Other
235 stars 74 forks source link

workers launched with htcondor cluster manager cannot connect back with master? #107

Closed rgavazzi closed 3 years ago

rgavazzi commented 5 years ago

I get the following error on my local cluster with htcondor scheduler ( julia version 1.1.0-DEV). 1

julia>  addproc_htc( 4 )   
Error launching condor
MethodError(iterate, (Process(`condor_submit /raid/gavazzi/.julia-htc/julia-1195449.sub`, ProcessExited(0)),), 0x00000000000061f6)
0-element Array{Int64,1}

The created condor script file seems OK:

executable = /bin/bash
arguments = ./julia-1195449.sh
universe = vanilla
should_transfer_files = yes
transfer_input_files = /home/dir/.julia-htc/julia-1195449.sh
Notification = Error
output = /home/dir/.julia-htc/julia-1195449-1.o
error= /home/dir/.julia-htc/julia-1195449-1.e
queue
output = /home/dir/.julia-htc/julia-1195449-2.o
error= /home/dir/.julia-htc/julia-1195449-2.e
queue
output = /home/dir/.julia-htc/julia-1195449-3.o
error= /home/dir/.julia-htc/julia-1195449-3.e
queue
output = /home/dir/.julia-htc/julia-1195449-4.o
error= /home/dir/.julia-htc/julia-1195449-4.e
queue

The temporary shell script file /home/dir/.julia-htc/julia-1195449.sh seems OK:

#!/bin/sh
cd /tmp
/usr/bin/julia --worker=o7tjjc9VsZGKA8qn | /usr/bin/telnet  machinenode.from_which_I_ran.julia 8848

All ouput *.o files look like: Trying 192.168.1.3...

All ouput *.e files look like: telnet: connect to address 192.168.1.3: Connection refused

(machinenode.from_which_I_ran.julia has IP address 192.168.1.3 , locally )

Other issue: The method "addprocs_htc(np::Integer) = addprocs(HTCManager(np))" does not seem to allow the specification a a different working directory. In many cases, htcondor will place the julia-1195449.sh and associated files into a temporary scratch working directory where one may want to stay during the worker lifetime. Couldn't we avoid that with a

(dir!=nothing) && println(scriptf, "cd $(Base.shell_escape(dir))")

and addprocs_htc(np::Integer ; dir=nothing ) = addprocs(HTCManager(np) , dir=dir)

change in condor.jl

vchuravy commented 5 years ago

Condor might need a similar fix to https://github.com/JuliaParallel/MPI.jl/pull/222

juliohm commented 3 years ago

Too old to reproduce. Please retry with the current stable release and reopen the issue if needed.

rgavazzi commented 3 years ago

As far as I can tell, the problem is stlll present!!! I keep failing launching workers with htcondor. The problem remains the same. telnet keeps complaining:

telnet: connect to address 192.168.1.3: Connection refused

If I directly run "nc -l 8200" on a machine mmm in the cluster and I telnet mmm 820 . Telnet connection succeeds!! It seems to me that equivalent of nc -l command is the listen(portnum) call at line 45 of the condor.jl script...

Anyhow, I'd be interested to read from anyone facing the same issue or not, while using ClusterManagers in a HTCondor scheduler!