Running pestpp on multiple nodes is slower

hamiddashti commented 5 years ago

This issue might not be really related to the pestpp itself and might be more of my bash scripting or our cluster setup. I'm using a cluster with 16 nodes and each node carry 28 cores. I can run the the pestpp-gsa in parallel using worker/slave on one node with the slurm script as below:

#!/bin/bash #SBATCH -n 1 # total number of tasks requested #SBATCH --cpus-per-task=1 # cpus to allocate per task #SBATCH -p shortq # queue (partition) -- defq, eduq, gpuq. #SBATCH -t 12:00:00 # run time (hh:mm:ss) - 12.0 hours in this. cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/master pestpp-gsa gsa_karun /h :4004 & MASTER_PID=$! cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws parallel -i bash -c "cd {} ; pestpp-gsa gsa_karun /h 127.0.0.1:4004" -- wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14 wrk15 wrk16 wrk17 wrk18 wrk19 wrk20 kill ${MASTER_PID}

The above code which uses 20 cores of one node works fine. I tried to use multiple nodes using more workers using the following script:

#!/bin/bash #SBATCH -N 4 #SBATCH --tasks-per-node=28 #SBATCH -p defq #SBATCH -t 120:00:00 ulimit -u 9999 ulimit -s unlimited ulimit -v unlimited cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/master pestpp-gsa gsa_karun /h :4004 & MASTER_PID=$! LEADER=$SLURMD_NODENAME NODELIST=($(scontrol show hostname $SLURM_JOB_NODELIST)) FOLDERS=(seq 1 112) for i inseq 0 111; do ssh -f ${NODELIST[$(echo "$i % 4" | bc)]} "cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/wrk${FOLDERS[$i]} ; nohup pestpp-gsa gsa_karun /h ${LEADER}:4004 > worker.log &" done wait ${MASTER_PID}

Although I'm using 112 cores now but it takes a lot more to finish the pestpp. I was wondering did anyone else run into the same problem? Or am I missing something here? I'm posting this here beacuse I'm not sure if its pestpp problem or our cluster setup. Thanks

jtwhite79 commented 5 years ago

I've only run pestpp on a slurm cluster a few times, but it scaled well (up to 1000 workers). @mwtoews provided me with the slurm scripts - maybe he has some insights about your scripts?

mwtoews commented 5 years ago

I haven't seen any scaling issues with slurm, but perhaps we're using different paradigms.

My pest[pp*|_hp] runs with Slurm consist of one master job, and one or more worker jobs. The master normally requests more RAM for inversions, and the workers only request the amount of RAM needed to run the simulations, which is often different (usually smaller). The workers are submitted as a multiple program configuration with srun --multi-prog multi.conf.

Here are some partial bits of the four files used to orchestrate Slurm runs.

master.sl

#!/bin/bash
#SBATCH --job-name=master
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=4G
#...
# Kick off master
srun master.sh /path/to/master pestpp-ies file.pst

master.sh

#!/bin/bash
set -e
pstbin=$1
pstdir=$2
pstfile=$3
pstflg=$4

cd "$pstdir"

# Get available port from host, write master.txt for workers
masterport=`python - <<EOF
import socket
s = socket.socket()
s.bind(('', 0))
print(s.getsockname()[1])
s.close()
EOF`
echo `hostname -s`:$masterport > master.txt

echo "Starting master: $pstbin $pstfile $pstflg /H :$masterport"
$pstbin $pstfile $pstflg /H :$masterport

workers.sh

#!/bin/bash
#SBATCH --job-name=workers-1
#SBATCH --ntasks=100
#SBATCH --mem-per-cpu=300M
#...
export PST_BIN=pestpp-ies
export MASTER_DIR=/path/to/master
export PST_FILE=file.pst
export PUT_DIR=/path/to/source/files

# Create a file for multi-prog srun
cd /scratch/workers/1  # this worker directory needs to be adjusted for each submission
touch multi.conf
for (( N=0; N<$SLURM_NTASKS; N++)); do
    echo "$N workers.sh $PST_BIN $MASTER_DIR $PST_FILE $PUT_DIR $N" >> multi.conf
done

# Kick off workers
srun --multi-prog multi.conf

workers.sl

#!/bin/bash
set -e
pstbin=$1
pstdir=$2
pstfile=$3
putdir=$4
instance=$5

cp -rp $putdir $instance
cd $instance
echo "Worker `hostname -s` running in `pwd`"

# Get master hostname and port
masterpath=$pstdir/master.txt
masterhostport=$(cat "$masterpath")

echo "Starting worker: $pstbin $pstfile /H $masterhostport"
$pstbin $pstfile /H $masterhostport

So I'd usually do a sbatch master.sl to start the master, then do sbatch workers.sl one or more times after the master has started with a separate suite of worker directories, and often different ntasks, depending on how busy the HPC is. Hope this helps!

hamiddashti commented 5 years ago

Thank you @jtwhite79 and @mwtoews . We are working on it and hopefully, it will be resolved. I'll keep you posted.

jkennedy-usgs commented 5 years ago

@hamiddashti did you get the pestpp/SLURM setup working well? I'm running into possibly the same issue, where it is running slower as a SLURM job than at an interactive prompt. I will try @mwtoews solution but I thought maybe you'd found something simpler. Thanks-

dwelter / pestpp

Running pestpp on multiple nodes is slower #35