JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

htcondor manager: failure when listening to a telnet commu #150

Open rgavazzi opened 3 years ago

rgavazzi commented 3 years ago

See recent comment on the unduly closed issue #107 !

In a nutshell: telnet connection between worker node and master node fails:

telnet: connect to address 192.168.1.3: Connection refused

Is anyone able to run addprocs_htc() on a cluster running htcondor scheduler?? The issue was posted when I was running julia version <=1. 1 but it is still here with v1.4 or v1.5

aminnj commented 3 years ago

Hi, I ran into this issue too. Based on the MPI change mentioned in https://github.com/JuliaParallel/ClusterManagers.jl/issues/107, I made a modification here that allows connections from remote machines

https://github.com/aminnj/ClusterManagers.jl/commit/f91789be45336b0c4ca949ffd9853ba283cbccdf#diff-54c957b90c04bed63e172caa4efa42b072b2e0aef85562ece656d68f8bc8337bL45-R57

In my case, I switched to nc since telnet wasn't available in my worker node environment. If it works out for you too, I can clean this up and make a PR

rgavazzi commented 3 years ago

Managed to test it finally. It seems to work on my cluster! I still get some erratic connection issues with some particular nodes on the cluster... but this may not be related to ClusterManagers !! I like the additional options, too! Thanks!

aminnj commented 3 years ago

Glad it works for you! :)

Moelf commented 3 years ago

@tanmaykm probably can close this? and make a new breaking release maybe? for both HTCondor and qsub related overhaul in #153