TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 32 forks source link

bash-related problems #44

Closed jklynch closed 6 years ago

jklynch commented 7 years ago

joblist.txt launcher.job.txt

We have seen two problems recently while using dynamic scheduling on Stampede when requesting more than one node. Minimal files to reproduce the issues are attached.

Both problems seem to resolve after replacing single brackets with double brackets and quoting variables in the launcher and paramrun scripts.

The first problem looks like this:

/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected

The second problem looks like this:

WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
...

This warning is repeated until the job times out.

Attached are a job script for SLURM called 'launcher.job' and a launcher job file called 'joblist' that demonstrate these issues when submitted like this:

sbatch -N 2 launcher.job

We see output like this:

LAUNCHER_WORKDIR: /work/04658/jklynch/test_launcher
Launcher: Setup complete.

------------- SUMMARY ---------------
   Number of hosts:    2
   Working directory:  /work/04658/jklynch/test_launcher
   Processes per host: 4
   Total processes:    8
   Total jobs:         8
   Scheduling method:  dynamic

-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 1 running job 4 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 0 running job 1 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 3 running job 3 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 2 running job 2 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
...

We will make a pull request with our changes to demonstrate how we resolved the problems.