CRosieWilliams commented 2 years ago

For running large ensembles on archer2, we need a way to submit multiple jobs to the same node. We have this documentation from archer2:

https://docs.archer2.ac.uk/user-guide/scheduler/#running-multiple-subjobs-that-each-use-a-fraction-of-a-node

But with the same settings as we use for BAS HPC, on archer2 we are submitting each job to a separate node.

I've tried a bit of fiddling around with the SBATCH options but it still submits each job to a separate node, which is prohibitively expensive.

Example of SBATCH from run_ensemble_member

SBATCH --output=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0/job.%j.%N.out

SBATCH --error=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0/job.%j.%N.err

SBATCH --chdir=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0

SBATCH --mail-type=begin,end,fail,requeue

SBATCH --mail-user=chll1@bas.ac.uk

SBATCH --job-name=PIGTHW3km_test_runs-0

SBATCH --nodes=1

SBATCH --partition=standard

SBATCH --qos=standard

SBATCH --account=n02-bas

SBATCH --cpus-per-task=1

SBATCH --tasks-per-node=2

SBATCH --time=00:23:30

SBATCH --signal=B:SIGINT@60

I also can't get the ensemble running if I use SBATCH option mem:

SBATCH --mem=12500M

JimCircadian commented 2 years ago

The solution to this will be job arrays. Though there are mechanisms by which we can leverage parallel jobs, they'd not be portable. However, the option (this shouldn't/doesn't need to be a wholesale change as it depends on the cluster configuration which is best) to use job arrays is necessary and perhaps even preferable as a default.

This is a big change but might actually make more sense in the long run, as semantically things will match up. That said, it's not without it's problems so I'll draft up as a feature in the first instance and work out the best move forward once sorted for WAVI.

JimCircadian commented 2 years ago

Discussed with Rosie, this is definitely on the cards but due to the way that the ordered list of jobs will have to be recombined on repeat batch runs for WAVIhpc to benefit from node sharing, run directory generation has to be detached from the list of runs being considered. The skip functionality means that we can symlink from abatchrun-2 to a dynamically generated run directory, for example, in such a way that n array srun submissions only collects those runs requiring execution, ensuring that they share the minimum amount of nodes effectively.

As this really isn't a small change and requires some good testing, it's not happening immediately especially as the current workflow is fine on BAS HPC. Additionally, the need for this is somewhat reduced if we distribute computation, which solves other problems. However, this option to leverage slurm arrays should be available, so it will get implemented for 0.6.0

environmental-forecasting / model-ensembler

Running multiple jobs on one node on archer 2 #39

SBATCH --output=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0/job.%j.%N.out

SBATCH --error=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0/job.%j.%N.err

SBATCH --chdir=/mnt/lustre/a2fs-work2/work/n02/n02/chll1/WAVIhpc/cases/PIGTHW3km_test_runs-0

SBATCH --mail-type=begin,end,fail,requeue

SBATCH --mail-user=chll1@bas.ac.uk

SBATCH --job-name=PIGTHW3km_test_runs-0

SBATCH --nodes=1

SBATCH --partition=standard

SBATCH --qos=standard

SBATCH --account=n02-bas

SBATCH --cpus-per-task=1

SBATCH --tasks-per-node=2

SBATCH --time=00:23:30

SBATCH --signal=B:SIGINT@60

SBATCH --mem=12500M