run code on HPC - Githubissues

ndangtt commented 4 months ago

This is an example .slurm script I use for my job array. Assuming that I have a list of hundreds of command lines that can be run in parallel (i.e., they're independent of each others). All command lines are put in a cmds.txt file (each line is a command line). The following script makes use of slurm's job array to launch those commands in parallel (the scheduling is done by slurm), each command takes 1 core and maximum 15 hours.

(in my experiments, each of those commands often corresponds to one RL agent training).

Important notes:

Please replace my project code sc122-nguyen with your project code: sc122-dimitri (I currently allocate 40k CPU hours to yours, if you need more, please let me know).
Please do not use the --exclusive flag in this context. It's a slurm flag for reserving the whole compute node to each job so we'll be charged for the whole node for each job even though we only use 1 core.

run.slurm

#!/bin/bash
# Slurm job options (name, compute nodes, job time)

#SBATCH --time=15:00:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --array=0-9

# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=sc122-nguyen
# We use the "standard" partition as we are running on CPU nodes
#SBATCH --partition=standard
# We use the "standard" QoS as our runtime is less than 4 days
#SBATCH --qos=standard

module load gcc
module load anaconda

# Change to the submission directory
cd $SLURM_SUBMIT_DIR

id=$((SLURM_ARRAY_TASK_ID+1))
./run.sh $id

run.sh

#!/bin/bash

# activate conda environment (this needs to be done everytime the job is run)
. /mnt/lustre/indy2lfs/work/sc122/sc122/nttd-sc122/setup-conda.sh

# run a command in cmds.txt
id=$1
eval $(sed "${id}q;d" cmds.txt)

Documentation of Cirrus can be found at: https://docs.cirrus.ac.uk/user-guide/introduction/

dimitri-rusin commented 4 months ago

Thank you so much!

Gonna try soon!

ndangtt commented 3 months ago

Useful slurm command:

List of current running jobs: squeue -u <username> -o="%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R

dimitri-rusin commented 3 months ago

Useful slurm command:

List of current running jobs: squeue -u <username> -o="%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R

@ndangtt Could you please specify a useful summary plot for the experiments? You were talking about having either hitting times or area under the curve plots across all experiment settings: Could you elaborate a bit more in writing?

Thanks!

ndangtt commented 3 months ago

@dimitri-rusin: sure! I've made a note here: https://github.com/dimitri-rusin/oll_onemax/issues/11

dimitri-rusin / oll_onemax

run code on HPC #10