Open micaeljtoliveira opened 8 months ago
In order to tune the processor layout, the first thing to do was to understand how each component scales. I've thus run a series of scaling runs, varying the number of CPU cores from 1 to 288 on the Cascade Lake partition of Gadi (a similar study was also done for the Sapphire Rapids, but results are very similar). The configuration is taken unchanged, expect for the duration of the simulation, which is restricted to one model day.
The profiling is done using the ESMF instrumentation implemented in the NUOPC coupler. Region names are therefore quite generic and not always obvious.
First we look at the full calculation ([ESMF]
) and the top-level regions: initialisation ([ensemble] Init 1
), finalisation ([ensemble] FinalizePhase1
), and time-stepping ([ensemble] RunPhase1
).
Unfortunately the parallel efficiency of the time-stepping region, which is the most important one, quickly drops bellow 60%.
Looking then at the scaling of each component:
From these plots it is clear that the efficiency of CICE and of the data models drops much faster than the one of MOM6, but the later is also not great. Note that for CICE, the number of cores used was capped at 76 (see https://github.com/COSIMA/access-om3/issues/91), so all data points for higher core count should be disregarded.
The above scaling results prompted a more detailed look into the MOM profiling data. Detailed MOM profiling data is available with the FMS instrumentation. Looking at the top level regions is already enough to identify a problematic region:
From the above plots, it seems that the surface forcing has a much worse scalability than the other regions (here we ignore the initialisation, as that's not part of the time-stepping).
I'm now having a closer look at the code in question to try to understand what is going on.
When using the NUOPC coupler, it is possible to run some of the components concurrently (see here for more details). As it is quite clear from the above plots, MOM6 is the component that takes the longest to execute a time-step. Therefore it should be beneficial to run it concurrently with the other components. To get some idea of how this works and the kind of speed-up we might get, I ran the test case as in the previous benchmark, using a fixed number of cores (48) and varying the distribution of the cores among the ocean component and all the other ones.
First thing to look at is again the total runtime and the time-stepping region:
This plot shows that the time-stepping region takes less and less time as we increase the number of cores used by MOM6, until it actually starts increasing at around 45 cores. This confirms the previous scaling results that showed no significant speed-up with core count for the ice and atmosphere components.
Comparing the best result for the time-stepping region obtained here (45 ocean cores) with the calculation using the same number of total cores from the previous benchmark, we observer a quite good speed-up: ~35 seconds instead of ~45 seconds.
It is also quite instructive to see the time spent in each region as a function of the number of ocean cores:
Note that here we are plotting the total time spend in a given region instead of the average over cores. This is necessary to have an apples-to-apples comparison, as each region is executed by a different number of cores. The plot also includes several of the mediator operations, as these can now take a significant fraction of the total runtime.
The surface forcing issue describe above has a very interesting origin.
In order to keep the script that launches all the runs for each benchmark simple, I set the CICE option max_blocks
of the domain
namelist to a large number (1000). This leads to a large memory footprint of the CICE data, which seems to mess-up the cache access for the MOM6 data. Setting max_blocks
to a more reasonable value for each core count leads to improved scalability and the percentage of time spent in the surface forcing routines does not increase as much.
Relevant plots:
In order to keep the script that launches all the runs for each benchmark simple, I set the CICE option max_blocks of the domain namelist to a large number (1000). This leads to a large memory footprint of the CICE data, which seems to mess-up the cache access for the MOM6 data. Setting max_blocks to a more reasonable value for each core count leads to improved scalability and the percentage of time spent in the surface forcing routines does not increase as much.
Yes, @micaeljtoliveira is correct. As is documented in access-om3/CICE/CICE/cicecore/shared/ice_domain_size.F90
,
50 !*** The model will inform the user of the correct
51 !*** values for the parameter below. A value higher than
52 !*** necessary will not cause the code to fail, but will
53 !*** allocate more memory than is necessary. A value that
54 !*** is too low will cause the code to exit.
55 !*** A good initial guess is found using
56 !*** max_blocks = (nx_global/block_size_x)*(ny_global/block_size_y)/
57 !*** num_procs
However, if max_blocks<1
, its value can be automatically calculated, as shown in access-om3/CICE/CICE/cicecore/cicedyn/infrastructure/ice_domain.F90
220 if (my_task == master_task) then
221 if (max_blocks < 1) then
222 max_blocks=( ((nx_global-1)/block_size_x + 1) * &
223 ((ny_global-1)/block_size_y + 1) - 1) / nprocs + 1
224 max_blocks=max(1,max_blocks)
225 write(nu_diag,'(/,a52,i6,/)') &
226 '(ice_domain): max_block < 1: max_block estimated to ',max_blocks
227 endif
228 endif
Is setting max_blocks
to 0 a better choice?
Is setting max_blocks to 0 a better choice?
This is what I want to test next, along with the automatic MOM6 processor land mask.
Latest update:
I tried setting max_blocks
to 0 in ice_in
, but this does not seem to work. I believe that's because the algorithm to determine max_blocks
does not work with the distribution we are using for the 1deg config.
I also tried to set AUTO_MASKTABLE = True
in MOM_input
and that works, but with one caveat: the code will stop with an error if there are not domains to mask. That means that, in those cases one needs to explicitly set AUTO_MASKTABLE = False
or remove the keyword from the file.
As expected, using a mask table improves the scalability for higher core counts, but it's not a game-changer for 1deg. Here is the plot for the parallel efficiency of the different components:
I also tried to set AUTO_MASKTABLE = True in MOM_input and that works, but with one caveat: the code will stop with an error if there are not domains to mask. That means that, in those cases one needs to explicitly set AUTO_MASKTABLE = False or remove the keyword from the file. As expected, using a mask table improves the scalability for higher core counts,
Did you do this with the latest MOM6 version? https://github.com/COSIMA/MOM6-CICE6/issues/38#issuecomment-2030815803 Do you mean that we still need a mask table as in OM2? As was mentioned here, https://github.com/COSIMA/access-om3/issues/122#issue-2193973697,
In MOM6, there is no longer an 'ocean_mask' file and the mask is set at runtime by the MINIMUM_DEPTH parameter in MOM and the topography.
If I understand it correctly, the masks have already been catched by MOM6 at runtime but we need to add a mask_table?
Will the mask_table used in OM2 be consistent with the auto-generated mask by MOM6?
Did you do this with the latest MOM6 version? https://github.com/COSIMA/MOM6-CICE6/issues/38#issuecomment-2030815803
Yes.
Do you mean that we still need a mask table as in OM2? As was mentioned here, https://github.com/COSIMA/access-om3/issues/122#issue-2193973697,
That's a different type of mask. AUTO_MASKTABLE
deals with the processor mask, not the land mask (although they are somehow related).
In case it's useful, I've been running my 1deg test runs with:
diff --git a/config.yaml b/config.yaml
index 0ff9d28..fce8794 100644
--- a/config.yaml
+++ b/config.yaml
@@ -9,12 +9,12 @@
# shortpath: /scratch/v45
queue: normal
-ncpus: 48
-jobfs: 10GB
-mem: 192GB
+ncpus: 240
+mem: 960GB
diff --git a/nuopc.runconfig b/nuopc.runconfig
index 8c630c0..0fdb9da 100644
--- a/nuopc.runconfig
+++ b/nuopc.runconfig
@@ -27,11 +27,11 @@ DRIVER_attributes::
::
PELAYOUT_attributes::
- atm_ntasks = 48
+ atm_ntasks = 24
atm_nthreads = 1
atm_pestride = 1
atm_rootpe = 0
- cpl_ntasks = 48
+ cpl_ntasks = 24
cpl_nthreads = 1
cpl_pestride = 1
cpl_rootpe = 0
@@ -40,31 +40,31 @@ PELAYOUT_attributes::
esp_nthreads = 1
esp_pestride = 1
esp_rootpe = 0
- glc_ntasks = 48
+ glc_ntasks = 1
glc_nthreads = 1
glc_pestride = 1
glc_rootpe = 0
- ice_ntasks = 48
+ ice_ntasks = 24
ice_nthreads = 1
ice_pestride = 1
ice_rootpe = 0
- lnd_ntasks = 48
+ lnd_ntasks = 1
lnd_nthreads = 1
lnd_pestride = 1
lnd_rootpe = 0
ninst = 1
- ocn_ntasks = 48
+ ocn_ntasks = 216
ocn_nthreads = 1
ocn_pestride = 1
- ocn_rootpe = 0
+ ocn_rootpe = 24
pio_asyncio_ntasks = 0
pio_asyncio_rootpe = 1
pio_asyncio_stride = 0
- rof_ntasks = 48
+ rof_ntasks = 24
rof_nthreads = 1
rof_pestride = 1
rof_rootpe = 0
- wav_ntasks = 48
+ wav_ntasks = 1
wav_nthreads = 1
wav_pestride = 1
wav_rootpe = 0
Thanks @dougiesquire . Are you planning to merge this into the 1-deg ryf branch now? I believe it would be beneficial for those (me now) running the 1-deg configuration.
It would be really great if someone else could make this change in all our 1 deg configs (that's kinda why I dumped it here)
Sure, I can do this change.
This issue is to keep track of the profiling/optimization done to efficiently run the 1deg configuration.