TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 33 forks source link

Here are the files for LSF #62

Open AJVincelli opened 3 years ago

AJVincelli commented 3 years ago

Hello,

I recently got Launcher to work on an LSF cluster! In case it's helpful, here are the files I used. These files include excessive comments and echo statements. Since I'm not an expert, I welcome any review and corrections.

This code was inspired by Issue #57. Thank you @siliu-tacc and @lwilson !

The "LSF.rmi" file, in the "plugins" folder:

#Launcher Resource Manager Integration (RMI) file for LSF

#Create a hostfile and set LAUNCHER_RMI_HOSTFILE
export LAUNCHER_RMI_HOSTFILE=`mktemp -t launcher.$LSB_JOBID.hostlist.XXXXXXXX`
echo "The hostfile name is:" $LAUNCHER_RMI_HOSTFILE

#Populate the hostfile
echo "The LSB_MCPU_HOSTS variable is:" $LSB_MCPU_HOSTS
echo $LSB_MCPU_HOSTS > LSB_MCPU_HOSTS_file # Prints the variable to a file, so awk can read it
awk '{for (j=1; j <= NF; j+=2) { print $j }}' LSB_MCPU_HOSTS_file > $LAUNCHER_RMI_HOSTFILE # Extracts every other word from LSB_MCPU_HOSTS_file and adds it as a new line to LAUNCHER_RMI_HOSTFILE
echo "The hostfile contents are:"
echo "$(<$LAUNCHER_RMI_HOSTFILE)"
#cp $LAUNCHER_RMI_HOSTFILE $HOME/Launcher_Test # Useful if you need to see the contents of the hostfile for troubleshooting
#echo "The LAUNCHER_RMI_HOSTFILE variable is: " $LAUNCHER_RMI_HOSTFILE # Useful for troubleshooting

#Set the number of hosts/nodes
export LAUNCHER_RMI_NHOSTS=$(( `echo $LSB_MCPU_HOSTS | wc -w` / 2))  # The actual number of nodes/hosts assigned to run your job, should be the same as -n / ptile
echo "The number of hosts/nodes is:" $LAUNCHER_RMI_NHOSTS

#Set the number of processes/tasks per node
export LAUNCHER_RMI_PPN=`echo $LSB_MCPU_HOSTS | awk '{print $2}'` # Assumes that each node/host has been assigned the same number of processes; not the same as ptile if > 1 task/process per core
echo "The number of processes/tasks per node is:" $LAUNCHER_RMI_PPN

The "launcher.lsf" file, in the "extras/batch-scripts" folder:

#!/bin/bash
#
# Launcher batch script file for LSF systems
# January 2nd, 2021
#
# Simple LSF script for submitting multiple serial
# jobs (e.g. parametric studies); this script wrapper
# launches the jobs.
#
#-------------------------------------------------------
# 
#         <------ Setup Parameters ------>
#
#BSUB -J launcher                        # Job name
#BSUB -n 300                             # Total number of cores/processors requested, set this to be the same as the total # tasks to avoid running multiple tasks sequentially on the same core
#BSUB -R span[ptile=15]                  # Number of cores/processors per node requested, calculate the total # nodes/hosts requested by dividing n (total # cores) by this ptile (# cores/processors per node), so 300/15=20 nodes requested, don't exceed the max of the queue
#BSUB -R rusage[mem=32]                  # Memory required per core/slot, default is 1 GB (1024), the total memory requested is rusage multiplied by the # cores, don't exceed the max of the queue
#BSUB -q short                           # Queue name
#BSUB -oo HelloWorld300LSFLauncher.log   # Name of the job log file, will say if any errors occurred, overwrites old log file
#BSUB -eo HelloWorld300LSFError.log      # Name of the error log file, and describes errors if any, overwrites old error file
#BSUB -W 0:01                            # Run time (h:mm), don't exceed the max of the queue
#
#------------------------------------------------------
# On LSF systems, "blaunch" is preferred over "ssh" (and sometimes ssh is not allowed).
# If you get a "Host key verification failed" error, replace the word "ssh" in line 308 of the "paramrun" file with the word "blaunch."
#------------------------------------------------------

module purge # Start with a clean environment
#module load python3/3.5.0 # Uncomment and customize this line if the cluster's default Python version is older than 2.7 

export LAUNCHER_DIR=/$HOME/launcher
export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins
export LAUNCHER_RMI=LSF
export LAUNCHER_WORKDIR=`pwd`
export LAUNCHER_JOB_FILE=helloworld_multi_output   # Should point to your job file, BE SURE TO INCLUDE AN EMPTY LINE AT THE END OF THE COMMAND SCRIPT (line 301) or Launcher will not push the last command in the list!

$LAUNCHER_DIR/paramrun