cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Connectivity issues for interactive jobs #393

Closed akahles closed 8 years ago

akahles commented 8 years ago

I am trying to get an interactive session on hal, which intermittently produced the following error:

qsub: waiting for job 7048764.hal-sched1.local to start
qsub: job 7048764.hal-sched1.local ready

Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111]

Unable to communicate with hal-sched1.local(IP)
Communication failure.
Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111]

Unable to communicate with hal-sched1.local(IP)
Communication failure.
Error in connection to trqauthd (15137)-[could not connect to unix socket /tmp/trqauthd-unix: 111]

Unable to communicate with hal-sched1.local(IP)
Communication failure.
qstat: cannot connect to server hal-sched1.local (errno=15137) could not connect to trqauthd

After a few trials, I gat an interactive session. However, I would like to report the observation. I removed the IP address from the message. I assume you have that anyways. Let me know if not.

tatarsky commented 8 years ago

I believe something is wrong with gpu-2-14. I am removing it from the pool.

tatarsky commented 8 years ago

Yep. gpu-2-14 is in a bad way and I believe your interactive jobs tried to go there.

akahles commented 8 years ago

Ok, thanks for looking into this. Interactive jobs on the other nodes are running along happily. I think this solves this problem. Thanks!