Optimization and scalability of MOM6-CICE6 1deg configurations

micaeljtoliveira commented 8 months ago

This issue is to keep track of the profiling/optimization done to efficiently run the 1deg configuration.

micaeljtoliveira commented 8 months ago

In order to tune the processor layout, the first thing to do was to understand how each component scales. I've thus run a series of scaling runs, varying the number of CPU cores from 1 to 288 on the Cascade Lake partition of Gadi (a similar study was also done for the Sapphire Rapids, but results are very similar). The configuration is taken unchanged, expect for the duration of the simulation, which is restricted to one model day.

The profiling is done using the ESMF instrumentation implemented in the NUOPC coupler. Region names are therefore quite generic and not always obvious.

First we look at the full calculation ([ESMF]) and the top-level regions: initialisation ([ensemble] Init 1), finalisation ([ensemble] FinalizePhase1), and time-stepping ([ensemble] RunPhase1).

Wall time (s) as a function of the number of CPU cores

Speed-up as a function of the number of CPU cores

Parallel efficiency as a function of the number of CPU cores

Unfortunately the parallel efficiency of the time-stepping region, which is the most important one, quickly drops bellow 60%.

Looking then at the scaling of each component:

From these plots it is clear that the efficiency of CICE and of the data models drops much faster than the one of MOM6, but the later is also not great. Note that for CICE, the number of cores used was capped at 76 (see https://github.com/COSIMA/access-om3/issues/91), so all data points for higher core count should be disregarded.

micaeljtoliveira commented 8 months ago

The above scaling results prompted a more detailed look into the MOM profiling data. Detailed MOM profiling data is available with the FMS instrumentation. Looking at the top level regions is already enough to identify a problematic region:

MOM_scaling

MOM_speedup

MOM_efficiency

From the above plots, it seems that the surface forcing has a much worse scalability than the other regions (here we ignore the initialisation, as that's not part of the time-stepping).

I'm now having a closer look at the code in question to try to understand what is going on.

micaeljtoliveira commented 8 months ago

When using the NUOPC coupler, it is possible to run some of the components concurrently (see here for more details). As it is quite clear from the above plots, MOM6 is the component that takes the longest to execute a time-step. Therefore it should be beneficial to run it concurrently with the other components. To get some idea of how this works and the kind of speed-up we might get, I ran the test case as in the previous benchmark, using a fixed number of cores (48) and varying the distribution of the cores among the ocean component and all the other ones.

First thing to look at is again the total runtime and the time-stepping region: OM3_concurrent

This plot shows that the time-stepping region takes less and less time as we increase the number of cores used by MOM6, until it actually starts increasing at around 45 cores. This confirms the previous scaling results that showed no significant speed-up with core count for the ice and atmosphere components.

Comparing the best result for the time-stepping region obtained here (45 ocean cores) with the calculation using the same number of total cores from the previous benchmark, we observer a quite good speed-up: ~35 seconds instead of ~45 seconds.

It is also quite instructive to see the time spent in each region as a function of the number of ocean cores:

OM3_concurrent_regions

Note that here we are plotting the total time spend in a given region instead of the average over cores. This is necessary to have an apples-to-apples comparison, as each region is executed by a different number of cores. The plot also includes several of the mediator operations, as these can now take a significant fraction of the total runtime.

micaeljtoliveira commented 7 months ago

The surface forcing issue describe above has a very interesting origin.

In order to keep the script that launches all the runs for each benchmark simple, I set the CICE option max_blocks of the domain namelist to a large number (1000). This leads to a large memory footprint of the CICE data, which seems to mess-up the cache access for the MOM6 data. Setting max_blocks to a more reasonable value for each core count leads to improved scalability and the percentage of time spent in the surface forcing routines does not increase as much.

Relevant plots:

OM3_efficiency

OM3_component_efficiency

MOM_efficiency

minghangli-uni commented 7 months ago

In order to keep the script that launches all the runs for each benchmark simple, I set the CICE option max_blocks of the domain namelist to a large number (1000). This leads to a large memory footprint of the CICE data, which seems to mess-up the cache access for the MOM6 data. Setting max_blocks to a more reasonable value for each core count leads to improved scalability and the percentage of time spent in the surface forcing routines does not increase as much.

Yes, @micaeljtoliveira is correct. As is documented in access-om3/CICE/CICE/cicecore/shared/ice_domain_size.F90,

 50    !*** The model will inform the user of the correct
 51    !*** values for the parameter below.  A value higher than
 52    !*** necessary will not cause the code to fail, but will
 53    !*** allocate more memory than is necessary.  A value that
 54    !*** is too low will cause the code to exit.
 55    !*** A good initial guess is found using
 56    !*** max_blocks = (nx_global/block_size_x)*(ny_global/block_size_y)/
 57    !***               num_procs

However, if max_blocks<1, its value can be automatically calculated, as shown in access-om3/CICE/CICE/cicecore/cicedyn/infrastructure/ice_domain.F90

220    if (my_task == master_task) then
221      if (max_blocks < 1) then
222        max_blocks=( ((nx_global-1)/block_size_x + 1) *         &
223                     ((ny_global-1)/block_size_y + 1) - 1) / nprocs + 1
224        max_blocks=max(1,max_blocks)
225        write(nu_diag,'(/,a52,i6,/)') &
226          '(ice_domain): max_block < 1: max_block estimated to ',max_blocks
227      endif
228    endif

Is setting max_blocks to 0 a better choice?

micaeljtoliveira commented 7 months ago

Is setting max_blocks to 0 a better choice?

This is what I want to test next, along with the automatic MOM6 processor land mask.

micaeljtoliveira commented 6 months ago

Latest update:

I tried setting max_blocks to 0 in ice_in, but this does not seem to work. I believe that's because the algorithm to determine max_blocks does not work with the distribution we are using for the 1deg config.

I also tried to set AUTO_MASKTABLE = True in MOM_input and that works, but with one caveat: the code will stop with an error if there are not domains to mask. That means that, in those cases one needs to explicitly set AUTO_MASKTABLE = False or remove the keyword from the file.

As expected, using a mask table improves the scalability for higher core counts, but it's not a game-changer for 1deg. Here is the plot for the parallel efficiency of the different components:

OM3_component_efficiency

minghangli-uni commented 6 months ago

I also tried to set AUTO_MASKTABLE = True in MOM_input and that works, but with one caveat: the code will stop with an error if there are not domains to mask. That means that, in those cases one needs to explicitly set AUTO_MASKTABLE = False or remove the keyword from the file. As expected, using a mask table improves the scalability for higher core counts,

Did you do this with the latest MOM6 version? https://github.com/COSIMA/MOM6-CICE6/issues/38#issuecomment-2030815803 Do you mean that we still need a mask table as in OM2? As was mentioned here, https://github.com/COSIMA/access-om3/issues/122#issue-2193973697,

In MOM6, there is no longer an 'ocean_mask' file and the mask is set at runtime by the MINIMUM_DEPTH parameter in MOM and the topography.

If I understand it correctly, the masks have already been catched by MOM6 at runtime but we need to add a mask_table?

Will the mask_table used in OM2 be consistent with the auto-generated mask by MOM6?

micaeljtoliveira commented 6 months ago

Did you do this with the latest MOM6 version? https://github.com/COSIMA/MOM6-CICE6/issues/38#issuecomment-2030815803

Yes.

Do you mean that we still need a mask table as in OM2? As was mentioned here, https://github.com/COSIMA/access-om3/issues/122#issue-2193973697,

That's a different type of mask. AUTO_MASKTABLE deals with the processor mask, not the land mask (although they are somehow related).

dougiesquire commented 2 months ago

In case it's useful, I've been running my 1deg test runs with:

diff --git a/config.yaml b/config.yaml
index 0ff9d28..fce8794 100644
--- a/config.yaml
+++ b/config.yaml
@@ -9,12 +9,12 @@
 # shortpath: /scratch/v45

 queue: normal
-ncpus: 48
-jobfs: 10GB
-mem: 192GB
+ncpus: 240
+mem: 960GB

diff --git a/nuopc.runconfig b/nuopc.runconfig
index 8c630c0..0fdb9da 100644
--- a/nuopc.runconfig
+++ b/nuopc.runconfig
@@ -27,11 +27,11 @@ DRIVER_attributes::
 ::

 PELAYOUT_attributes::
-     atm_ntasks = 48
+     atm_ntasks = 24
      atm_nthreads = 1
      atm_pestride = 1
      atm_rootpe = 0
-     cpl_ntasks = 48
+     cpl_ntasks = 24
      cpl_nthreads = 1
      cpl_pestride = 1
      cpl_rootpe = 0
@@ -40,31 +40,31 @@ PELAYOUT_attributes::
      esp_nthreads = 1
      esp_pestride = 1
      esp_rootpe = 0
-     glc_ntasks = 48
+     glc_ntasks = 1
      glc_nthreads = 1
      glc_pestride = 1
      glc_rootpe = 0
-     ice_ntasks = 48
+     ice_ntasks = 24
      ice_nthreads = 1
      ice_pestride = 1
      ice_rootpe = 0
-     lnd_ntasks = 48
+     lnd_ntasks = 1
      lnd_nthreads = 1
      lnd_pestride = 1
      lnd_rootpe = 0
      ninst = 1
-     ocn_ntasks = 48
+     ocn_ntasks = 216
      ocn_nthreads = 1
      ocn_pestride = 1
-     ocn_rootpe = 0
+     ocn_rootpe = 24
      pio_asyncio_ntasks = 0
      pio_asyncio_rootpe = 1
      pio_asyncio_stride = 0
-     rof_ntasks = 48
+     rof_ntasks = 24
      rof_nthreads = 1
      rof_pestride = 1
      rof_rootpe = 0
-     wav_ntasks = 48
+     wav_ntasks = 1
      wav_nthreads = 1
      wav_pestride = 1
      wav_rootpe = 0

minghangli-uni commented 2 months ago

Thanks @dougiesquire . Are you planning to merge this into the 1-deg ryf branch now? I believe it would be beneficial for those (me now) running the 1-deg configuration.

dougiesquire commented 2 months ago

It would be really great if someone else could make this change in all our 1 deg configs (that's kinda why I dumped it here)

minghangli-uni commented 2 months ago

Sure, I can do this change.

COSIMA / access-om3

Optimization and scalability of MOM6-CICE6 1deg configurations #130