Performance scaling - Githubissues

MartinDix commented 1 month ago

From Spencer's 20240827-release-preindustrial+concentrations-run with 192 atmosphere, 180 ocean, 16 ice PEs

Walltime from UM atm.fort6.pe0 files

esm_time

Difference between the PBS time and the UM time is around 60 s. Variation here will make scaling analysis trickier.

MartinDix commented 1 month ago

Similarly for the 20240823-dev-historical-concentrations-run esm_time

This seems somehow system related because both got faster at about the same time ~ Aug 31.

MartinDix commented 3 weeks ago

From the internal UM timer diagnostics, e.g.

 Maximum Elapsed Wallclock Time:    4347.89442100020
 Speedup:    191.985828730169
 --------------------------------------------
                              Non-Inclusive Timer Summary for PE    0
   ROUTINE              CALLS  TOT CPU    AVERAGE   TOT WALL  AVERAGE  % CPU    % WALL    SPEED-UP
  1 ATM_STEP             ****    836.89      0.05    836.93      0.05   19.25   19.25      1.00
  2 Convect              ****    523.78      0.03    523.78      0.03   12.05   12.05      1.00
  3 SL_tracer2           ****    469.82      0.03    469.83      0.03   10.81   10.81      1.00
  4 SL_tracer1           ****    445.61      0.03    445.62      0.03   10.25   10.25      1.00
...

Calculated standard deviations of the routines. Sorted values are	Routine	SD
TOTAL	173.8	4452
SFEXCH	94.6	182
Atmos_Physics2	46.3	146
PUTO2A_COMM	15.1	31
ATM_STEP	12.9	829
U_MODEL	11.4	18
GETO2A_COMM	7.4	55

Notable that most of the standard deviation comes from just two relatively inexpensive routines. Excluding these removes most of the variation.

esm_time2

SFEXCH includes boundary layer calculations w/o any communications so it's a mystery why it should be so variable.

blimlim commented 2 weeks ago

The routines U_MODEL and PUTO2A_COMM also contribute to the variation, though less so: Slow routines hist

In the pre-industrial simulation however, U_MODEL had a larger contribution to the total walltime valiation: Slow routines PI

Other typically expensive routines show much less variation: Stable routines hist

Slowdowns of the high variability routines is fairly consistent across all the processors, according to the MPP : None Inclusive timer summary WALLCLOCK TIMES table, where there's typically a ~4s difference between the min and max time for SFEXCH: Historical run SFEXCH

U_MODEL is less consistent (~30s difference between max and min), though the max and min times follow similar variations: Historical run U_MODEL

blimlim commented 2 weeks ago

Repeating a simulation of a single year, with runs broken down into month long segments, also has large variation across the different simulations and different months. Some individual months take almost double the time of others. total walltime

The same four routines contribute significantly to the variations, especially U_MODEL. At this smaller scale, the four routine's times don't as clearly correlate with each other. routines month

and in this case, there's still significant variation after subtracting these routine's times from the total walltime: total walltime minus routines

blimlim commented 2 weeks ago

Occasional extreme cases can occur. The following results were for a month long simulation configured to write separate UM output each day, run on 2024-10-04. The exact same simulation was then repeated on 2024-10-08:

2024-10-04:

Walltime Used: 00:21:16

2024-10-08:

Walltime Used: 00:07:14

Differences in mean routine times across the PEs are Routine	Time difference (s)
SFEXCH	251.73
U_MODEL	198.00
Atmos_Physics2	160.68
ATM_STEP	83.77
PUTO2A_COMM	69.58
Q_Pos_Ctl	39.90
GETO2A_COMM	13.09
SW Rad	7.97
INCRTIME	3.01
Diags	2.27
INITIAL	2.27
PPCTL_REINIT	1.06
MICROPHYS_CTL	0.16
DUMPCTL	0.13
SF_IMPL	0.13
INITDUMP	0.07
PHY_DIAG	0.05
PE_Helmholtz	0.03
UP_ANCIL	0.02
KMKHZ	0.02

Plots of the same thing:

slow_routines

fast_routines

It's mostly the same routines from before causing the slowdown. However ATM_STEP and Q_Pos_CTL are both much slower in the 20204-10-04 run, while they were both pretty stable in the longer historical simulation.

blimlim commented 1 week ago

These variations make it hard to get strong conclusions from scaling tests. From 5 year simulations with various atmosphere decompositions, providing an extra node to the atmosphere, and using a decomposition: atm.240.X16.Y15 appears to decrease the run length by ~10% with very little impact on the SU usage:

5 Runs atmosphere scaling

and cuts down the significant ocean coupling wait time: coupling wait times

However when extending the atm.240.X16.Y15-ocn.180-ice.12 run to 20 years, there was massive variations in the walltimes, with the worst years being slower than the slowest times for the default atm.192.X16.Y12-ocn.180-ice.12 decomposition.

The same routines were again implicated in the slowdowns: atm_240_routine_times

It's hard to tell whether adding more processors to the atmosphere somehow made the timing inherently more unstable, or if that run was unlucky and was impacted by unknown external factors. There are some simultaneous runs of each decomposition currently going, which will hopefully help us understand this better.

blimlim commented 1 day ago

Running amip simulations with 192 (standard) and 240 processors, the runs appeared a much more stable, with only one single jump occurring for the 192 pe simulation. The jump again corresponded to increased time in the Atmos_Physics2 and SFEXCH routines, however the other routines usually implicated seemed fairly stable.

It's unclear whether the amip configuration is inherently more stable than the coupled configuration, or if system conditions at the time happened to make these runs more stable.

To test this, we tried running two amip and two coupled simulations at the same time (192 and 240 atm pes in each case, with 16 pes in the x direction and 15 in the y direction for the 240 pe case). Each configuration was run one year at a time, which was set off at the same time for each configuration. Due to a mistake, the first three years of the 240 pe configurations were out of sync, additionally one year's PBS logs of the 192 coupled run went missing somehow, but apart from that, the runs were in sync.

Very few instabilities occurred in any of the runs:

Only one year of the 240pe coupled run jumped up, however this was due to the atmosphere waiting for the ocean, and appears to have a different cause than the instabilities we'd seen so far:

Based on these simulations alone, using 240 processors with the x:16, y:15 decomposition looks favourable for the default configuration. We get an ~10% drop in walltime for practically no difference in SUs. However it's again hard to make any concrete recommendations due to the instabilities that have appeared in our other tests.

ACCESS-NRI / access-esm1.5-configs

Performance scaling #94