TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 33 forks source link

launcher runs too many jobs when using multiple nodes #48

Open schristley opened 6 years ago

schristley commented 6 years ago

I filed a TACC ticket (Ticket #43327) as I don't know if this is a launcher bug, or just an issue with TACC's current version of launcher.

I'm using launcher module to run multiple processes across multiples nodes. When the number of commands is less than the total processes available, launcher runs the "last" command multiple times.

A test job that reproduces the bug is available here:

/scratch/01114/vdj/vdj/launcher-test

I create a simple joblist with 12 echo commands. LAUNCHER_PPN=4 and the job requests 4 nodes, which means that a total of 16 concurrent processes could be run, though only 12 are needed. Here you can see the summary printed by launcher.

------------- SUMMARY --------------- Number of hosts: 4 Working directory: /scratch/01114/vdj/vdj/launcher-test Processes per host: 4 Total processes: 16 Total jobs: 12 Scheduling method: interleaved


The last command in joblist is "echo 12", and this command is actually run 5 times. If you look at job.out, even though there are 12 total jobs, 16 jobs are actually run, with the last one being run multiple times.

schristley commented 6 years ago

As you probably don't have access to TACC, here are the test files, this is the job.sh

#!/bin/bash
#SBATCH -J repcalc_bcr4_test
#SBATCH -o job.out
#SBATCH -e job.err
#SBATCH -t 01:00:00
#SBATCH -p skx-normal
#SBATCH -N 4 -n 48
#SBATCH -A RepServer

module purge
module load TACC
module load launcher
module load python

rm -f joblist
touch joblist
echo "echo 1" >> joblist
echo "echo 2" >> joblist
echo "echo 3" >> joblist
echo "echo 4" >> joblist
echo "echo 5" >> joblist
echo "echo 6" >> joblist
echo "echo 7" >> joblist
echo "echo 8" >> joblist
echo "echo 9" >> joblist
echo "echo 10" >> joblist
echo "echo 11" >> joblist
echo "echo 12" >> joblist

# Launcher to use multicores on node
export LAUNCHER_WORKDIR=$PWD
export LAUNCHER_PPN=4
export LAUNCHER_JOB_FILE=joblist
export LAUNCHER_SCHED=interleaved

$LAUNCHER_DIR/paramrun
schristley commented 6 years ago

Here is the output from running the job:

Launcher: Setup complete.

------------- SUMMARY ---------------
   Number of hosts:    4
   Working directory:  /scratch/01114/vdj/vdj/launcher-test
   Processes per host: 4
   Total processes:    16
   Total jobs:         12
   Scheduling method:  interleaved

-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 0 running job 1 on c479-111.stampede2.tacc.utexas.edu (echo 1)
Launcher: Task 3 running job 4 on c479-111.stampede2.tacc.utexas.edu (echo 4)
1
4
Launcher: Task 1 running job 2 on c479-111.stampede2.tacc.utexas.edu (echo 2)
2
Launcher: Task 2 running job 3 on c479-111.stampede2.tacc.utexas.edu (echo 3)
3
Launcher: Job 2 completed in 0 seconds.
Launcher: Job 1 completed in 0 seconds.
Launcher: Job 3 completed in 0 seconds.
Launcher: Job 4 completed in 0 seconds.
Launcher: Task 1 done. Exiting.
Launcher: Task 0 done. Exiting.
Launcher: Task 3 done. Exiting.
Launcher: Task 2 done. Exiting.
Launcher: Task 5 running job 6 on c479-112.stampede2.tacc.utexas.edu (echo 6)
Launcher: Task 6 running job 7 on c479-112.stampede2.tacc.utexas.edu (echo 7)
Launcher: Task 7 running job 8 on c479-112.stampede2.tacc.utexas.edu (echo 8)
Launcher: Task 4 running job 5 on c479-112.stampede2.tacc.utexas.edu (echo 5)
6
7
8
5
Launcher: Task 10 running job 11 on c490-084.stampede2.tacc.utexas.edu (echo 11)
11
Launcher: Task 8 running job 9 on c490-084.stampede2.tacc.utexas.edu (echo 9)
9
Launcher: Task 13 running job 14 on c490-091.stampede2.tacc.utexas.edu (echo 12)
Launcher: Task 15 running job 16 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
12
Launcher: Task 14 running job 15 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Task 11 running job 12 on c490-084.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Task 9 running job 10 on c490-084.stampede2.tacc.utexas.edu (echo 10)
10
Launcher: Task 12 running job 13 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Job 5 completed in 0 seconds.
Launcher: Job 7 completed in 0 seconds.
Launcher: Job 8 completed in 0 seconds.
Launcher: Job 11 completed in 0 seconds.
Launcher: Job 6 completed in 0 seconds.
Launcher: Job 9 completed in 0 seconds.
Launcher: Task 7 done. Exiting.
Launcher: Job 14 completed in 0 seconds.
Launcher: Task 6 done. Exiting.
Launcher: Task 4 done. Exiting.
Launcher: Job 12 completed in 0 seconds.
Launcher: Job 16 completed in 0 seconds.
Launcher: Job 10 completed in 0 seconds.
Launcher: Task 10 done. Exiting.
Launcher: Task 5 done. Exiting.
Launcher: Job 15 completed in 0 seconds.
Launcher: Task 8 done. Exiting.
Launcher: Job 13 completed in 0 seconds.
Launcher: Task 13 done. Exiting.
Launcher: Task 15 done. Exiting.
Launcher: Task 11 done. Exiting.
Launcher: Task 9 done. Exiting.
Launcher: Task 14 done. Exiting.
Launcher: Task 12 done. Exiting.
Launcher: Done. Job exited without errors
schristley commented 6 years ago

I guess this is a duplicate of #16