Open MartinDix opened 1 month ago
Similarly for the 20240823-dev-historical-concentrations-run
This seems somehow system related because both got faster at about the same time ~ Aug 31.
From the internal UM timer diagnostics, e.g.
Maximum Elapsed Wallclock Time: 4347.89442100020
Speedup: 191.985828730169
--------------------------------------------
Non-Inclusive Timer Summary for PE 0
ROUTINE CALLS TOT CPU AVERAGE TOT WALL AVERAGE % CPU % WALL SPEED-UP
1 ATM_STEP **** 836.89 0.05 836.93 0.05 19.25 19.25 1.00
2 Convect **** 523.78 0.03 523.78 0.03 12.05 12.05 1.00
3 SL_tracer2 **** 469.82 0.03 469.83 0.03 10.81 10.81 1.00
4 SL_tracer1 **** 445.61 0.03 445.62 0.03 10.25 10.25 1.00
...
Calculated standard deviations of the routines. Sorted values are | Routine | SD | Total time |
---|---|---|---|
TOTAL | 173.8 | 4452 | |
SFEXCH | 94.6 | 182 | |
Atmos_Physics2 | 46.3 | 146 | |
PUTO2A_COMM | 15.1 | 31 | |
ATM_STEP | 12.9 | 829 | |
U_MODEL | 11.4 | 18 | |
GETO2A_COMM | 7.4 | 55 |
Notable that most of the standard deviation comes from just two relatively inexpensive routines. Excluding these removes most of the variation.
SFEXCH includes boundary layer calculations w/o any communications so it's a mystery why it should be so variable.
The routines U_MODEL
and PUTO2A_COMM
also contribute to the variation, though less so:
In the pre-industrial simulation however, U_MODEL
had a larger contribution to the total walltime valiation:
Other typically expensive routines show much less variation:
Slowdowns of the high variability routines is fairly consistent across all the processors, according to the MPP : None Inclusive timer summary WALLCLOCK TIMES
table, where there's typically a ~4s difference between the min and max time for SFEXCH
:
U_MODEL
is less consistent (~30s difference between max and min), though the max and min times follow similar variations:
Repeating a simulation of a single year, with runs broken down into month long segments, also has large variation across the different simulations and different months. Some individual months take almost double the time of others.
The same four routines contribute significantly to the variations, especially U_MODEL
. At this smaller scale, the four routine's times don't as clearly correlate with each other.
and in this case, there's still significant variation after subtracting these routine's times from the total walltime:
Occasional extreme cases can occur. The following results were for a month long simulation configured to write separate UM output each day, run on 2024-10-04. The exact same simulation was then repeated on 2024-10-08:
2024-10-04:
Walltime Used: 00:21:16
2024-10-08:
Walltime Used: 00:07:14
Differences in mean routine times across the PEs are Routine | Time difference (s) |
---|---|
SFEXCH | 251.73 |
U_MODEL | 198.00 |
Atmos_Physics2 | 160.68 |
ATM_STEP | 83.77 |
PUTO2A_COMM | 69.58 |
Q_Pos_Ctl | 39.90 |
GETO2A_COMM | 13.09 |
SW Rad | 7.97 |
INCRTIME | 3.01 |
Diags | 2.27 |
INITIAL | 2.27 |
PPCTL_REINIT | 1.06 |
MICROPHYS_CTL | 0.16 |
DUMPCTL | 0.13 |
SF_IMPL | 0.13 |
INITDUMP | 0.07 |
PHY_DIAG | 0.05 |
PE_Helmholtz | 0.03 |
UP_ANCIL | 0.02 |
KMKHZ | 0.02 |
Plots of the same thing:
It's mostly the same routines from before causing the slowdown. However ATM_STEP
and Q_Pos_CTL
are both much slower in the 20204-10-04 run, while they were both pretty stable in the longer historical simulation.
These variations make it hard to get strong conclusions from scaling tests. From 5 year simulations with various atmosphere decompositions, providing an extra node to the atmosphere, and using a decomposition: atm.240.X16.Y15
appears to decrease the run length by ~10% with very little impact on the SU usage:
and cuts down the significant ocean coupling wait time:
However when extending the atm.240.X16.Y15-ocn.180-ice.12
run to 20 years, there was massive variations in the walltimes, with the worst years being slower than the slowest times for the default atm.192.X16.Y12-ocn.180-ice.12
decomposition.
The same routines were again implicated in the slowdowns:
It's hard to tell whether adding more processors to the atmosphere somehow made the timing inherently more unstable, or if that run was unlucky and was impacted by unknown external factors. There are some simultaneous runs of each decomposition currently going, which will hopefully help us understand this better.
Running amip simulations with 192 (standard) and 240 processors, the runs appeared a much more stable, with only one single jump occurring for the 192 pe simulation.
The jump again corresponded to increased time in the Atmos_Physics2
and SFEXCH
routines, however the other routines usually implicated seemed fairly stable.
It's unclear whether the amip configuration is inherently more stable than the coupled configuration, or if system conditions at the time happened to make these runs more stable.
To test this, we tried running two amip and two coupled simulations at the same time (192 and 240 atm pes in each case, with 16 pes in the x direction and 15 in the y direction for the 240 pe case). Each configuration was run one year at a time, which was set off at the same time for each configuration. Due to a mistake, the first three years of the 240 pe configurations were out of sync, additionally one year's PBS logs of the 192 coupled run went missing somehow, but apart from that, the runs were in sync.
Very few instabilities occurred in any of the runs:
Only one year of the 240pe coupled run jumped up, however this was due to the atmosphere waiting for the ocean, and appears to have a different cause than the instabilities we'd seen so far:
Based on these simulations alone, using 240 processors with the x:16, y:15 decomposition looks favourable for the default configuration. We get an ~10% drop in walltime for practically no difference in SUs. However it's again hard to make any concrete recommendations due to the instabilities that have appeared in our other tests.
From Spencer's 20240827-release-preindustrial+concentrations-run with 192 atmosphere, 180 ocean, 16 ice PEs
Walltime from UM
atm.fort6.pe0
filesDifference between the PBS time and the UM time is around 60 s. Variation here will make scaling analysis trickier.