I've noticed that sometimes when running single-cpu jobs on lots of nodes, the tskserver gets overwhelmed by ncat requests. The result is that some of these ncat requests hang indefinitely causing the job to run into its walltime. I managed to do this on lonestar5 with only 3 nodes, but it can probably be replicated elsewhere.
Anyways, I have a crude fix, but I figured I'd make an issue rather than a pull request as there might be a cleaner way to fix the issue.
Adding this line at the top of launcher fixed the problem for me:
Hi,
I've noticed that sometimes when running single-cpu jobs on lots of nodes, the
tskserver
gets overwhelmed byncat
requests. The result is that some of thesencat
requests hang indefinitely causing the job to run into its walltime. I managed to do this on lonestar5 with only 3 nodes, but it can probably be replicated elsewhere.Anyways, I have a crude fix, but I figured I'd make an issue rather than a pull request as there might be a cleaner way to fix the issue.
Adding this line at the top of
launcher
fixed the problem for me: