TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 32 forks source link

tskserver and the thundering herd #55

Closed bcov77 closed 4 years ago

bcov77 commented 5 years ago

Hi,

I've noticed that sometimes when running single-cpu jobs on lots of nodes, the tskserver gets overwhelmed by ncat requests. The result is that some of these ncat requests hang indefinitely causing the job to run into its walltime. I managed to do this on lonestar5 with only 3 nodes, but it can probably be replicated elsewhere.

Anyways, I have a crude fix, but I figured I'd make an issue rather than a pull request as there might be a cleaner way to fix the issue.

Adding this line at the top of launcher fixed the problem for me:

sleep $[ ( $RANDOM % 50 )  + 1 ]s  # random 50s sleep
siliu-tacc commented 4 years ago

I can update it in the next release. Thx

siliu-tacc commented 4 years ago

50s is too long

sleep $[ ( $RANDOM % 24 ) + 1 ]s