We have seen two problems recently while using dynamic scheduling on Stampede when requesting more than one node. Minimal files to reproduce the issues are attached.
Both problems seem to resolve after replacing single brackets with double brackets and quoting variables in the launcher and paramrun scripts.
The first problem looks like this:
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
The second problem looks like this:
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
...
This warning is repeated until the job times out.
Attached are a job script for SLURM called 'launcher.job' and a launcher job file called 'joblist' that demonstrate these issues when submitted like this:
sbatch -N 2 launcher.job
We see output like this:
LAUNCHER_WORKDIR: /work/04658/jklynch/test_launcher
Launcher: Setup complete.
------------- SUMMARY ---------------
Number of hosts: 2
Working directory: /work/04658/jklynch/test_launcher
Processes per host: 4
Total processes: 8
Total jobs: 8
Scheduling method: dynamic
-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 1 running job 4 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 0 running job 1 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 3 running job 3 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
Launcher: Task 2 running job 2 on c517-101.stampede.tacc.utexas.edu (sleep 1 && echo $LAUNCHER_JID)
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
/home1/04658/jklynch/launcher/launcher: line 82: [: -gt: unary operator expected
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
...
We will make a pull request with our changes to demonstrate how we resolved the problems.
joblist.txt launcher.job.txt
We have seen two problems recently while using dynamic scheduling on Stampede when requesting more than one node. Minimal files to reproduce the issues are attached.
Both problems seem to resolve after replacing single brackets with double brackets and quoting variables in the launcher and paramrun scripts.
The first problem looks like this:
The second problem looks like this:
This warning is repeated until the job times out.
Attached are a job script for SLURM called 'launcher.job' and a launcher job file called 'joblist' that demonstrate these issues when submitted like this:
We see output like this:
We will make a pull request with our changes to demonstrate how we resolved the problems.