Open bcdarwin opened 10 years ago
Another consideration is to use the job dependencies that are part of Torque/MOAB. Submit the job with the server, then submit the executors but with a dependency of starting only once the server has started. This might mean there is less wasted time (don’t know if that is true, has to be tried) than making the server submit new jobs itself, as those new jobs might wait in the queue for hours or days. On Oct 14, 2014, at 12:04 PM, Ben Darwin notifications@github.com wrote:
Currently these sleep for 1000s to ensure the server has started, but this is a waste of SciNet resources. One partial fix is to wait only until the uri file is created; see implementations in Bash [cc10439] and in Python [9410da6]. (We poll every 10s, while asking for notification would be nicer.)
A much more polite solution would be to submit executors only once the server has started. This could be done either via the server submitting jobs (using SSH into the login nodes) or by submitting an empty job (with the same wall time limit as the server) along with the server and making the executor jobs depend on this job's completion.
— Reply to this email directly or view it on GitHub.
I didn't realize from the Scinet docs that Torque allows doing so, but this is probably the best idea (this is what I was proposing to simulate via trivial jobs).
We could also eliminate the complexity of starting a new server over SSH by immediately submitting a number (based on our non-existent time estimates) of servers depending on each others' completion, and similarly for clients.
See #122.
Currently these sleep for 1000s to ensure the server has started, but this is a waste of SciNet resources. One partial fix is to wait only until the uri file is created; see implementations in Bash [cc10439b31] and in Python [9410da6474]. (We poll every 10s, while asking for notification would be nicer.)
A much more polite solution would be to submit executors only once the server has started. This could be done either via the server submitting jobs (using SSH into the login nodes) or by submitting an empty job (with the same wall time limit as the server) along with the server and making the executor jobs depend on this job's completion.