Urban-M4 / snakemake-wrf

Random notes, scripts, et cetera
Apache License 2.0
1 stars 1 forks source link

Parallellization of WRF #9

Open Peter9192 opened 2 months ago

Peter9192 commented 2 months ago

I was looking into the different options for parallelizing WRF, always find it quite confusing.

From what I understand now, it works like this:

The reason to use MPI is because each patch only has fewer grid cells to process, i.e. shorter loops, i.e. faster execution. However, the the overhead for communication between patches increases with the number of patches.

From this, I constructed the following test script:

#!/bin/bash
#SBATCH --job-name=wrf_experiment     # Job name
#SBATCH --partition=thin              # Request thin partition. Up to one hour goes to fast queue
#SBATCH --time=1:00:00                # Maximum runtime (D-HH:MM:SS)
#SBATCH --nodes=1                     # Number of nodes (one thin node has 128 cpu cores)
#SBATCH --ntasks=64                   # Number of tasks per node / number of patches in the domain - parallelized with MPI / DMPAR / multiprocessing
#SBATCH --cpus-per-task=2             # Number of CPU cores per task / number of tiles within each patch - parallelized with OpenMP / SMPAR / multithreading

# Note: number cpus-per-task * ntasks should not exceed the total available cores on requested nodes
# 8*16 = 128 exactly fits on one thin node.

# Load dependencies
module load 2023
module load netCDF-Fortran/4.6.1-gompi-2023a  # also loads gcc and gompi
module load Python/3.11.3-GCCcore-12.3.0

export NETCDF=$(nf-config --prefix)

# Each process can do multithreading but limited to the number of cpu cores allocated to each process
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

mpiexec $HOME/wrf-model/WRF/run/wrf.exe

Then, I did a few small sensitivity tests with my current testcase. Here's the results:

1 thin node of 128 cores n_tasks cpus_per_task Timing for main on D01 after completing 1 minute comment
16 8 128 s. First try 'evenly'
8 16 162 s. Less DMPAR, more SMPAR
128 1 - Crashed: <10 grid cells per patch in y-direction (7)
64 2 43 s. Apparently DMPAR scales better than SMPAR for this case
Peter9192 commented 2 months ago

Some more statistics (after finishing the job, i.e. 1 hour simulated):

Nodes: 1
Cores per node: 128
CPU Utilized: 1-11:25:40
CPU Efficiency: 49.18% of 3-00:02:08 core-walltime
Job Wall-clock time: 00:33:46
Memory Utilized: 51.87 GB
Memory Efficiency: 23.16% of 224.00 GB

So there's still room for improvement :-)

Peter9192 commented 1 month ago

I just realized that I performed the above tests with WRF compiled for dmpar only, so it makes sense that it scaled better. Should try again with compile option 35 instead of 34

Peter9192 commented 1 month ago

New set of tests with WRF compiled for DMPAR + SMPAR and the reference setup for high-res Amsterdam

Using rome partition of 128 cores per node

nodes n_tasks cpus_per_task Timing for main on D01 after completing 10 seconds comment
1 16 8 145 s.
1 8 16 193 s. / 144s.
1 128 1 - crashed due to too small domain
1 1 128 - terribly slow - after 5 minutes it had not yet solved 2 seconds of simulation time
1 64 2 48 s.
1/2 64 1 46 s. faster without multithreading with only two threads
4 64 8 68 s. / 56 s. more threads doesn't seem to scale at all
8 64 16 91 s. indeed, more thread no help
1/4 24 1 95 s. / 86 s. sharing node; upper range of sweet spot according to https://forum.mmm.ucar.edu/threads/choosing-an-appropriate-number-of-processors.5082/
1 4 32 66 s. using --map-by node:PE=$OMP_NUM_THREADS --rank-by core
1 32 4 21 s. ,,
1 64 2 25 s. ,,
2 64 4 18 s. ,,
2 32 8 19 s. ,,
4 32 16 18 s. ,,
4 64 8 21 s. ,,
8 64 16 18 s. ,,
Peter9192 commented 1 month ago

Note; the above timings are for the first timestep. After that, the simulation speeds up, and subsequent timings for main are roughly two times faster. So we're around 1:1 simulation/run time

Peter9192 commented 1 month ago

New benchmarks now also comparing with intel compilers and on Genoa nodes for a change

GNU compilers (like above) / Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   64.65104 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   49.12214 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 8 cpus per task (1 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.50725 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.32626 elapsed seconds

GNU compilers (like above) Genoa with 24 tasks / 16 cpus per task (2 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   21.65084 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.15081 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.87352 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.99886 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - no extra specifiers
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.71106 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   25.24836 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `--ppn $((SLURM_NTASKS / SLURM_JOB_NUM_NODES))`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   40.94793 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   28.25645 elapsed seconds

Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   33.41033 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   34.02773 elapsed seconds

Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain   1:   16.24445 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.98943 elapsed seconds

Intel version doesn't seem to be much faster out of the box, nor does it scale better without further tuning. Also it seems to respond less well to my tweaking attempts with -ppn and domain pinning.

Peter9192 commented 1 month ago

Running with srun instead of mpirun/mpiexec might automatically/better be able to map to the hardware... https://nrel.github.io/HPC/blog/2021-06-18-srun/#3-why-not-just-use-mpiexecmpirun

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun
Timing for main: time 2019-07-23_06:00:10 on domain   1:   22.77393 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.48143 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.32052 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.33800 elapsed seconds

GNU compilers - Genoa with 24 tasks / 8 cpus per task (1 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.41277 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.80804 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.27320 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.14655 elapsed seconds

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=close
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.33385 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.20183 elapsed seconds

OMP_PLACES = sockets gives bad performance, `threads` is similar to cores. PROC_BIND=master is terribly slow.

GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true and --ntasks-per-core=1 -n $SLURM_NTASKS
Timing for main: time 2019-07-23_06:00:10 on domain   1:   14.58144 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    6.84186 elapsed seconds

GNU compilers - Genoa with 24 tasks / 4 cpus per task (1/2 node) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   18.84828 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:   10.00387 elapsed seconds
--> So openmp from 4 to 8 cores does improve performance (but within a node).

GNU compilers - Genoa with 48 tasks / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.61810 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.67543 elapsed seconds

GNU compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.06588 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.35846 elapsed seconds

GNU compilers - Rome with 3 nodes / 16 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   17.40340 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.67457 elapsed seconds
--> Same number of cores but spread over more nodes is slower.

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun
Failed to launch
Fixed by `export I_MPI_OFI_PROVIDER=mlx` as per [this suggestion](https://community.intel.com/t5/Intel-MPI-Library/Unable-to-run-with-Intel-MPI-on-any-fabric-setting-except-TCP/m-p/1408609)
Very slow / doesn't reach timings within one minute of job

Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun with KMP_AFFINITY=compact
Very slow / doesn't reach timings within one minute of job
Peter9192 commented 1 month ago
GNU compilers - Genoa with 3 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.59579 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.05039 elapsed seconds

GNU compilers - Genoa with 4 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   11.19707 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    3.87675 elapsed seconds

GNU compilers - Genoa with 2 nodes / 48 tasks per node / 4 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   12.77392 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    4.69613 elapsed seconds

GNU compilers - Genoa with 1 nodes / 96 tasks per node / 2 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain   1:   15.96459 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain   1:    7.38751 elapsed seconds
Peter9192 commented 1 month ago

Conclusions so far