TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 32 forks source link

Adding PBS support. #53

Closed chrisblanton closed 4 years ago

chrisblanton commented 5 years ago

We have added a RMI for PBS-based schedulers as well as an example script for PBS to correspond to the Slurm and SGE examples.

AJVincelli commented 3 years ago

Hi there, I was just looking through the PBS.rmi file, and I'm not quite understanding it. Specifically, lines 12-13 look like they're getting the total number of nodes (not the number of processes per node):

# Set the number of processes per node
export LAUNCHER_RMI_PPN=$(uniq -c $PBS_NODEFILE | awk '{print $1}' | uniq)

I'm not totally sure though? I haven't tried Launcher with PBS, so it could work just fine.

chrisblanton commented 3 years ago

@AJVincelli This was a little hack to deal with the way that PBS handles the PPN. There actually should be a BASH variable that contains this information, but it's incorrect in our site's implementation (and probably in others), but the nodefile has to be correct for MPI to function. Also, unless directed otherwise, our implementation tends to put multiple sets of processes on one node (i.e. nodes=2:ppn=4 may become nodes=1:ppn=8 effectively).

Here's how that calculation works:

[cblanton7@atl1-1-02-014-20-r ~]$ uniq -c $PBS_NODEFILE
      4 atl1-1-02-014-20-r.pace.gatech.edu
      4 atl1-1-02-010-29-r.pace.gatech.edu
[cblanton7@atl1-1-02-014-20-r ~]$ uniq -c $PBS_NODEFILE | awk '{print $1}'
4
4
[cblanton7@atl1-1-02-014-20-r ~]$ uniq -c $PBS_NODEFILE | awk '{print $1}' | uniq
4

Here's an example of the second thing I was talking about

[cblanton7@login-phoenix-2 ~]$ qsub -l nodes=2:ppn=4 -A pace-admins -q inferno -I
qsub: waiting for job 261399.sched-torque.pace.gatech.edu to start
qsub: job 261399.sched-torque.pace.gatech.edu ready

---------------------------------------
Begin PBS Prologue Mon Jan  4 09:28:36 EST 2021
Job ID:     261399.sched-torque.pace.gatech.edu
User ID:    cblanton7
Job name:   STDIN
Queue:      inferno
End PBS Prologue Mon Jan  4 09:28:36 EST 2021
---------------------------------------
[cblanton7@atl1-1-02-014-17-r ~]$ uniq -c $PBS_NODEFILE
      8 atl1-1-02-014-17-r.pace.gatech.edu
[cblanton7@atl1-1-02-014-17-r ~]$ uniq -c $PBS_NODEFILE | awk '{print $1}'
8
[cblanton7@atl1-1-02-014-17-r ~]$ uniq -c $PBS_NODEFILE | awk '{print $1}' | uniq
8

As the proof of the pudding, it does seem to work with our users who are using it. For what's it worth, I prefer PyLauncher now, but it does a have a little higher learning curve. I think the power of combining what you can do with a Python script and the PyLauncher framework can make for some more efficient HTC. Also, it has support for parallel programs in a HTC manner. HTPC as I like to call it.

AJVincelli commented 3 years ago

Hi @chrisblanton, interesting! Yes, this makes sense. My misunderstanding was that I thought the PBS_NODEFILE only contained the node names (not the number of processors for each node too). Thank you for explaining! Your explanation restores my confidence that I did the LSF plugin correctly.

And thanks for mentioning PyLauncher, I didn't know it existed... I will check it out.