Closed penguian closed 7 years ago
@martin.dix@anu.edu.au changed status from assigned
to accepted
@martin.dix@anu.edu.au set owner to mrd599
@martin.dix@anu.edu.au commented
Back with vn7.3 using the UM climate means with the Gregorian calendar required saving a daily dump file. This was an unacceptable overhead so I modified the code to calculate monthly means outside the normal climate mean code. This worked for CMIP5 but didn't have the full capabilities of the UM code (e.g. more recent UM AMIP jobs calculate monthly means of fields sampled at 0000, 0300 0600 etc to get a mean diurnal cycle which my code couldn't do).
UM vn9.2 introduced the ability to calculate climate means and only save a dump at the end of the month. However the model still writes a partial sum file every day. This isn't as large as a dump file but still requires gathering every field that's being averaged to one processor for writing. Worse, this code is outside the normal STASH system so the IO server can't help.
The overhead of this can be over 10%. Timer statistics from a month of u-aj458-gregorian, GA7.1 AMIP run on 256 cores
Maximum Elapsed Wallclock Time: 3961.69
ROUTINE CALLS TOT CPU AVERAGE TOT WALL AVERAGE
1 U_MODEL_4A 1 3942.91 3942.91 3960.33 3960.33
2 Atm_Step_4A (AS) 2160 3340.90 1.55 3350.36 1.55
3 AS Atmos_Phys1 (AP1) 2160 1413.25 0.65 1413.39 0.65
4 UKCA_MAIN1 2160 585.67 0.27 593.72 0.27
5 AP1 Radiation (AP1R) 2160 521.65 0.24 521.70 0.24
6 MEANCTL 30 495.03 16.50 499.85 16.66
These partial sum files are probably unnecessary but getting rid of them completely isn't simple. However with a minimal code change each processor can write its own local version of these and so skip the gather. This removes almost all the overhead.
Maximum Elapsed Wallclock Time: 3539.27
ROUTINE CALLS TOT CPU AVERAGE TOT WALL AVERAGE SPEED-UP
1 U_MODEL_4A 1 3516.50 3516.50 3536.87 3536.87 0.99
2 Atm_Step_4A (AS) 2160 3351.15 1.55 3361.98 1.56 1.00
3 AS Atmos_Phys1 (AP1) 2160 1417.10 0.66 1416.00 0.66 1.00
4 UKCA_MAIN1 2160 588.69 0.27 600.75 0.28 0.98
5 AP1 Radiation (AP1R) 2160 521.46 0.24 521.27 0.24 1.00
6 UKCA AEROSOL MODEL 720 474.48 0.66 474.25 0.66 1.00
...
32 MEANCTL 30 6.29 0.21 8.36 0.28 0.75
The monthly mean file is identical to the standard run.
In one month coupled model runs (UM using 384 cores) the benefit was even larger (though perhaps some of this is run to run timing variability) | Standard | Elapsed Wallclock Time: 3137.99 |
---|---|---|
New | Elapsed Wallclock Time: 2513.36 |
I also tried writing the files to /jobfs rather than /short but this is actually slightly slower for some reason.
Code branch is branches/dev/martindix/vn10.6_gregorian_climate_means
. Trac view of changes https://code.metoffice.gov.uk/trac/um/changeset?reponame=&new=38760%40main%2Fbranches%2Fdev%2Fmartindix%2Fvn10.6_gregorian_climate_means%2Fsrc&old=29678%40main%2Ftrunk%2Fsrc.
meanctl.F90
has
LOGICAL, PARAMETER :: local_acumps # .true., use_jobfs .false.
These could be made into namelist variables if necessary.
The only suite change strictly required is to add branches/dev/martindix/vn10.6_gregorian_climate_means@38760
to the um_sources
list in app/fcm_make_um/rose-app.conf
. However it's probably worth changing the location of the partial sum files, just so History_Data doesn't have several hundred temporary files. In app/um/rose-app.conf
set
psum_filename_base='${RUNID}a_s'
so that the files are created in the work directory.
@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive
@martin.dix@anu.edu.au changed status from accepted
to closed
@martin.dix@anu.edu.au set resolution to fixed
@martin.dix@anu.edu.au commented
Changed format of processor local files to use 4 digits for the processor number after Roger had a crash using 1064 cores.
Suites should use branches/dev/martindix/vn10.6_gregorian_climate_means@[42947]
resolution_fixed
| by mrd599@nci.org.auIssue migrated from trac:310 at 2024-01-31 18:28:17 +1100