tskserver and the thundering herd

bcov77 commented 5 years ago

Hi,

I've noticed that sometimes when running single-cpu jobs on lots of nodes, the tskserver gets overwhelmed by ncat requests. The result is that some of these ncat requests hang indefinitely causing the job to run into its walltime. I managed to do this on lonestar5 with only 3 nodes, but it can probably be replicated elsewhere.

Anyways, I have a crude fix, but I figured I'd make an issue rather than a pull request as there might be a cleaner way to fix the issue.

Adding this line at the top of launcher fixed the problem for me:

sleep $[ ( $RANDOM % 50 )  + 1 ]s  # random 50s sleep

siliu-tacc commented 4 years ago

I can update it in the next release. Thx

siliu-tacc commented 4 years ago

50s is too long

sleep $[ ( $RANDOM % 24 ) + 1 ]s

TACC / launcher

tskserver and the thundering herd #55