Open Peter9192 opened 2 months ago
Some more statistics (after finishing the job, i.e. 1 hour simulated):
Nodes: 1
Cores per node: 128
CPU Utilized: 1-11:25:40
CPU Efficiency: 49.18% of 3-00:02:08 core-walltime
Job Wall-clock time: 00:33:46
Memory Utilized: 51.87 GB
Memory Efficiency: 23.16% of 224.00 GB
So there's still room for improvement :-)
I just realized that I performed the above tests with WRF compiled for dmpar only, so it makes sense that it scaled better. Should try again with compile option 35 instead of 34
New set of tests with WRF compiled for DMPAR + SMPAR and the reference setup for high-res Amsterdam
Using rome partition of 128 cores per node
nodes | n_tasks | cpus_per_task | Timing for main on D01 after completing 10 seconds | comment |
---|---|---|---|---|
1 | 16 | 8 | 145 s. | |
1 | 8 | 16 | 193 s. / 144s. | |
1 | 128 | 1 | - | crashed due to too small domain |
1 | 1 | 128 | - | terribly slow - after 5 minutes it had not yet solved 2 seconds of simulation time |
1 | 64 | 2 | 48 s. | |
1/2 | 64 | 1 | 46 s. | faster without multithreading with only two threads |
4 | 64 | 8 | 68 s. / 56 s. | more threads doesn't seem to scale at all |
8 | 64 | 16 | 91 s. | indeed, more thread no help |
1/4 | 24 | 1 | 95 s. / 86 s. | sharing node; upper range of sweet spot according to https://forum.mmm.ucar.edu/threads/choosing-an-appropriate-number-of-processors.5082/ |
1 | 4 | 32 | 66 s. | using --map-by node:PE=$OMP_NUM_THREADS --rank-by core |
1 | 32 | 4 | 21 s. | ,, |
1 | 64 | 2 | 25 s. | ,, |
2 | 64 | 4 | 18 s. | ,, |
2 | 32 | 8 | 19 s. | ,, |
4 | 32 | 16 | 18 s. | ,, |
4 | 64 | 8 | 21 s. | ,, |
8 | 64 | 16 | 18 s. | ,, |
Note; the above timings are for the first timestep. After that, the simulation speeds up, and subsequent timings for main are roughly two times faster. So we're around 1:1 simulation/run time
New benchmarks now also comparing with intel compilers and on Genoa nodes for a change
GNU compilers (like above) / Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain 1: 64.65104 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 49.12214 elapsed seconds
GNU compilers (like above) Genoa with 24 tasks / 8 cpus per task (1 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain 1: 14.50725 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.32626 elapsed seconds
GNU compilers (like above) Genoa with 24 tasks / 16 cpus per task (2 node) - `--map-by node:PE=$OMP_NUM_THREADS --rank-by core`
Timing for main: time 2019-07-23_06:00:10 on domain 1: 21.65084 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 10.15081 elapsed seconds
Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - no mapping/binding/ranking
Timing for main: time 2019-07-23_06:00:10 on domain 1: 16.87352 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 7.99886 elapsed seconds
Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - no extra specifiers
Timing for main: time 2019-07-23_06:00:10 on domain 1: 40.71106 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 25.24836 elapsed seconds
Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `--ppn $((SLURM_NTASKS / SLURM_JOB_NUM_NODES))`
Timing for main: time 2019-07-23_06:00:10 on domain 1: 40.94793 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 28.25645 elapsed seconds
Intel compilers Genoa with 24 tasks / 16 cpus per task (2 nodes) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain 1: 33.41033 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 34.02773 elapsed seconds
Intel compilers Genoa with 24 tasks / 8 cpus per task (1 node) - `export I_MPI_PIN_DOMAIN=omp`
Timing for main: time 2019-07-23_06:00:10 on domain 1: 16.24445 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 7.98943 elapsed seconds
Intel version doesn't seem to be much faster out of the box, nor does it scale better without further tuning. Also it seems to respond less well to my tweaking attempts with -ppn and domain pinning.
Running with srun instead of mpirun/mpiexec might automatically/better be able to map to the hardware... https://nrel.github.io/HPC/blog/2021-06-18-srun/#3-why-not-just-use-mpiexecmpirun
GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun
Timing for main: time 2019-07-23_06:00:10 on domain 1: 22.77393 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 10.48143 elapsed seconds
GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain 1: 15.32052 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.33800 elapsed seconds
GNU compilers - Genoa with 24 tasks / 8 cpus per task (1 node) - srun - OMP_PLACES=cores and OMP_PROC_BIND=spread
Timing for main: time 2019-07-23_06:00:10 on domain 1: 14.41277 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.80804 elapsed seconds
GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true
Timing for main: time 2019-07-23_06:00:10 on domain 1: 14.27320 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.14655 elapsed seconds
GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=close
Timing for main: time 2019-07-23_06:00:10 on domain 1: 14.33385 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.20183 elapsed seconds
OMP_PLACES = sockets gives bad performance, `threads` is similar to cores. PROC_BIND=master is terribly slow.
GNU compilers - Genoa with 24 tasks / 16 cpus per task (2 nodes) - srun - OMP_PLACES=cores and OMP_PROC_BIND=true and --ntasks-per-core=1 -n $SLURM_NTASKS
Timing for main: time 2019-07-23_06:00:10 on domain 1: 14.58144 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 6.84186 elapsed seconds
GNU compilers - Genoa with 24 tasks / 4 cpus per task (1/2 node) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 18.84828 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 10.00387 elapsed seconds
--> So openmp from 4 to 8 cores does improve performance (but within a node).
GNU compilers - Genoa with 48 tasks / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 12.61810 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 4.67543 elapsed seconds
GNU compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task (2 nodes) - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 12.06588 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 4.35846 elapsed seconds
GNU compilers - Rome with 3 nodes / 16 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 17.40340 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 7.67457 elapsed seconds
--> Same number of cores but spread over more nodes is slower.
Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun
Failed to launch
Fixed by `export I_MPI_OFI_PROVIDER=mlx` as per [this suggestion](https://community.intel.com/t5/Intel-MPI-Library/Unable-to-run-with-Intel-MPI-on-any-fabric-setting-except-TCP/m-p/1408609)
Very slow / doesn't reach timings within one minute of job
Intel compilers - Genoa with 2 nodes / 24 tasks per node / 8 cpus per task - srun with KMP_AFFINITY=compact
Very slow / doesn't reach timings within one minute of job
GNU compilers - Genoa with 3 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 11.59579 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 4.05039 elapsed seconds
GNU compilers - Genoa with 4 nodes / 24 tasks per node / 8 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 11.19707 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 3.87675 elapsed seconds
GNU compilers - Genoa with 2 nodes / 48 tasks per node / 4 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 12.77392 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 4.69613 elapsed seconds
GNU compilers - Genoa with 1 nodes / 96 tasks per node / 2 cpus per task - srun - OMP_PLACES=cores
Timing for main: time 2019-07-23_06:00:10 on domain 1: 15.96459 elapsed seconds
Timing for main: time 2019-07-23_06:00:20 on domain 1: 7.38751 elapsed seconds
Conclusions so far
mpiexec
with --map-by node:PE=$OMP_NUM_THREADS --rank-by core
srun
with export OMP_PLACES=cores
I was looking into the different options for parallelizing WRF, always find it quite confusing.
From what I understand now, it works like this:
The reason to use MPI is because each patch only has fewer grid cells to process, i.e. shorter loops, i.e. faster execution. However, the the overhead for communication between patches increases with the number of patches.
From this, I constructed the following test script:
Then, I did a few small sensitivity tests with my current testcase. Here's the results: