TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 32 forks source link

Issue with/question about dynamic scheduling method #46

Open ironbars opened 6 years ago

ironbars commented 6 years ago

Hello,

I'm having a bit of trouble getting dynamic scheduling to work properly on my cluster. Using a single node, it works fine. However, every time I attempt to use more than one node, it appears as if the launcher process hangs on the second node, and the first node (the node that has the task server running on it) completes all of the jobs alone.

I have verified that the correct ports are open between the compute nodes. For instance, I can nc -l localhost 9471 on node1, connect to that process using nc -4 node1 9471 on node2, and successfully pass arbitrary text back and forth. When I try to run the tskserver manually, however (i.e. ./tskserver 5 localhost 9471) on node1, the above nc command on node2 fails with connection refused.

I have also verified that the launcher script is actually getting started on node2 (via top), but it doesn't appear to be doing any of the work. When I look at the job output, I just see tasks being executed on node1. When they're done, the output is just a bunch "connection refused" messages from netcat.

I'm on CentOS 7.3, if that has any bearing. Please let me know if you need any additional information from me. Any help would be appreciated.

Thank you, Marc

ironbars commented 6 years ago

I forgot to add that the "block" and "interleaved" scheduling methods work just fine with multi-node jobs. Are there any advantages to using any one of the three over the other two?

lwilson commented 6 years ago

Hi Marc,

It sounds like tskserver is not binding to the port on node1, which is why the connection refused error is occurring. What version of Python are you running?

The other two scheduling methods are static methods, so they do not have to communicate with tskserver (which doesn't even run). These two methods work really well if you know the runtime of all jobs are approximately the same. They are faster than dynamic and more scalable, but if you have variability in runtimes for your individual jobs, these two methods can leave cores idle.

ironbars commented 6 years ago

Hi Lucas,

Thank you for the prompt response!

I'm running Python 2.7.5. I think that tskserver is binding to the port; if I run it, I can see it show up in the output of netstat -nlp. The state is LISTEN and it is tied to the Python process that tskserver starts. Also, if I run nc localhost 9471 on node1 it will respond with an number as expected (which would be a Launcher job ID, if I understand the code correctly). This would indicate that tskserver is binding to the port correctly, right? It's perfectly possible that I'm misunderstanding how such things work.

Thank you!

ironbars commented 6 years ago

Hi Lucas,

I've made a terrible error. When I run ./tskserver 5 localhost 9471, it is binding and listening on the loopback interface (so only connections coming from 127.0.0.0/8 will be accepted!). When I run it on the actual network interface (using ./tskserver 5 $HOSTNAME 9471) it will serve the integers as normal, even to a remote host.

However, it still doesn't appear to work within a job, and now I'm really at a loss to figure out why that is.