[0.25deg] update PE layout

minghangli-uni commented 2 months ago

To update the PE layout for the 0.25 deg configuration, nuopc.runconfig, ice_in and config.yaml require corresponding modifications. Note: These changes will be updated when the configuration is revised.

nuopc.runconfig

PELAYOUT_attributes::
  atm_ntasks = 48
  atm_nthreads = 1
  atm_pestride = 1
  atm_rootpe = 0
  cpl_ntasks = 96
  cpl_nthreads = 1
  cpl_pestride = 1
  cpl_rootpe = 0
  esmf_logging = ESMF_LOGKIND_NONE
  esp_ntasks = 1
  esp_nthreads = 1
  esp_pestride = 1
  esp_rootpe = 0
  glc_ntasks = 1
  glc_nthreads = 1
  glc_pestride = 1
  glc_rootpe = 0
  ice_ntasks = 96
  ice_nthreads = 1
  ice_pestride = 1
  ice_rootpe = 0
  lnd_ntasks = 1
  lnd_nthreads = 1
  lnd_pestride = 1
  lnd_rootpe = 0
  ninst = 1
  ocn_ntasks = 1344
  ocn_nthreads = 1
  ocn_pestride = 1
  ocn_rootpe = 96
  pio_asyncio_ntasks = 0
  pio_asyncio_rootpe = 1
  pio_asyncio_stride = 0
  rof_ntasks = 48
  rof_nthreads = 1
  wav_ntasks = 1
  wav_nthreads = 1
  wav_pestride = 1
  wav_rootpe = 0
::

ice_in

&domain_nml
  block_size_x = 30
  block_size_y = 27
  distribution_type = "roundrobin"
  distribution_wght = "latitude"
  maskhalo_bound = .true.
  maskhalo_dyn = .true.
  maskhalo_remap = .true.
  max_blocks = 20
  ns_boundary_type = "tripole"
  nx_global = 1440
  ny_global = 1080
  processor_shape = "square-ice"
  debug_blocks = True
/

config.yaml

queue: normal
ncpus: 1440
jobfs: 10GB
mem: 5760GB

walltime: 24:00:00
jobname: 025deg_jra55do_ryf

model: access-om3

dougiesquire commented 2 months ago

You'll also need to update the config.yaml for the new ncpus and mem

minghangli-uni commented 2 months ago

Right, thanks @dougiesquire

anton-seaice commented 2 months ago

I can't remember where we got those block sizes from, we should get better performance if we can reduce max_blocks (say to 10?) by setting the blocksizes differently.

Sorry I was wrong last week - we did put in a patch for max_blocks ... you can remove it from the namelist. Its still good to check the logs to get it to closer to 10.

The process would be - pick number of procs, then set block_size_x & block_size_y such that the blocks are close to square, and there are around 10 per PE (ideally also nx_global is divisible by block_size_x and ny_global is divisible by block_size_y)

We can also remove debug_blocks - but whilst setting the block size it provides useful information

minghangli-uni commented 2 months ago

I came across this issue again https://github.com/COSIMA/access-om3/issues/156, where I forgot to adjust the pio settings after changing CICE layout.

     pio_numiotasks = 5
     pio_rearranger = 1
     pio_root = 1
     pio_stride = 48

@anton-seaice I understand the calculations, but could you please clarify why the ICE pio settings are configured this way? Will this improve the performance?

The error message isn’t very intuitive, making it difficult for users to realise that they need to modify these parameters when changing the layout.

Can we revert it to the settings used in the 1deg configuration, here https://github.com/ACCESS-NRI/access-om3-configs/blob/2bc6107ef1b195aa62485a5d87c4ba834996d8cc/nuopc.runconfig#L364-L373?

ICE_modelio::
     diro = ./log
     logfile = ice.log
     pio_async_interface = .false.
     pio_netcdf_format = nothing
     pio_numiotasks = 1
     pio_rearranger = 1
     pio_root = 0
     pio_stride = 48
     pio_typename = netcdf4p

minghangli-uni commented 2 months ago

I can't remember where we got those block sizes from

The block sizes were adopted from OM2 report, which specifies a CICE5 block size of 30x27, with a square-ice processor shape and roundrobin distribution type.

Its still good to check the logs to get it to closer to 10.

I cant remember why having the number of blocks close to 10?

anton-seaice commented 2 months ago

@anton-seaice I understand the calculations, but could you please clarify why the ICE pio settings are configured this way? Will this improve the performance?

In the old COSIMA TWG minutes from OM2 development (on the COSIMA website) the recommendation from NCI was to use one task per node. I think the Yang 2019 on Parallel I/O in MOM5 makes similar suggestion ? I guess there is a hardware benefit to one task per node. There's so many options there its hard to know what the best combination is without lots of work. e.g. we could also test having a dedicated IO PE, or changing the PIO_rearranger

I think one IO task per node is a good start. We could try just one IO task, it might not make much difference at this resolution.

The error message isn’t very intuitive, making it difficult for users to realise that they need to modify these parameters when changing the layout.

I agree, does it make a seperate ESMF log file ? I think they have names something like PETXX.ESMF...

It possible there are options in the ESMF build to change how the logging is done.
A good thing to do would be to check in payu if the PE layout fits within the request compute resources

The block sizes were adopted from OM2 report, which specifies a CICE5 block size of 30x27, with a square-ice processor shape and roundrobin distribution type.

Ok thanks!

I cant remember why having the number of blocks close to 10?

From the cice docs :

Smaller, more numerous blocks provides an opportunity for better load balance by allocating each processor both ice-covered and ice-free blocks. But smaller, more numerous blocks becomes less efficient due to MPI communication associated with halo updates. In practice, blocks should probably not have fewer than about 8 to 10 grid cells in each direction, and more square blocks tend to optimize the volume-to-surface ratio important for communication cost. Often 3 to 8 blocks per processor provide the decompositions flexiblity to create reasonable load balance configurations.

So we should actually aim for number of blocks of 8 or less by the sounds of it :)

minghangli-uni commented 2 months ago

I think one IO task per node is a good start. We could try just one IO task, it might not make much difference at this resolution.

I agree for the current phase. I will do a test on the I/O tasks to verify the optimal configuration.

does it make a seperate ESMF log file ? I think they have names something like PETXX.ESMF...

This can be enabled by setting thiscreate_esmf_pet_files to true in drv_in, but this should be used mostly for debugging purposes, not in production runs. And would it be helpful to add a comment after the PE setup for ice_ntasks to reference this issue? For example: ice_ntasks = 96 #NB: Parallel I/O github.com/COSIMA/access-om3/issues/214 This would inform users meeting the issue about the current setup, and we can remove the comment once the I/O is optimised.

So we should actually aim for number of blocks of 8 or less by the sounds of it :)

The updated settings result in a max_blocks of 5, in the range of 3-8 blocks per processor, which aligns with CICE docs.

&domain_nml
  block_size_x = 60
  block_size_y = 54
  distribution_type = "roundrobin"
  distribution_wght = "latitude"
  maskhalo_bound = .true.
  maskhalo_dyn = .true.
  maskhalo_remap = .true.
  max_blocks = -1
  ns_boundary_type = "tripole"
  nx_global = 1440
  ny_global = 1080
  processor_shape = "square-ice"
/

minghangli-uni commented 2 months ago

When setting max_blocks = -1 with the roundrobin distribution type, the max_blocks prescribed by CICE does not always match the actual number of ice blocks. E.g., with the above configuration, max_blocks is set to 6, but the log shows a warning:

 534   block_size_x,_y       =     60    54
 535   max_blocks            =      6
 536   Number of ghost cells =      1
 537
 538  (ice_read_global_nc) min, max, sum =   -1.41413909065909
 539    1.57079632679490        154674.873407807      ulat
 540  (ice_read_global_nc) min, max, sum =   0.000000000000000E+000
 541    1.00000000000000        969809.000000000      kmt
 542  ice_domain work_unit, max_work_unit =        28035          10
 543  ice_domain nocn =            0      280343    44787740
 544  ice_domain work_per_block =            0          11        2204
 545  ice: total number of blocks is         391
 546   ********WARNING***********
 547  (init_domain_distribution)
 548   WARNING: ice no. blocks too large: decrease max to           5

Despite this warning, I don’t believe it will impact overall performance since MOM typically has a much higher computational load than CICE.

NB: max_blocks = -1 with the rake distribution type fails.

anton-seaice commented 2 months ago

Why do you think max_blocks shouldn't be 5 ?

minghangli-uni commented 2 months ago

It can be 5, but we have to manually modify it to be 5

anton-seaice commented 2 months ago

Oh sorry, I see now. That's something about the patch we put into access-om3 0.3.x for removing max_blocks, and the max_blocks calculation being approximate. When we update the cice version it should go away (after https://github.com/CICE-Consortium/CICE/pull/954)

It will allocate ~20% more memory than it uses , but it uses a small enough amount of memory there probably isn't a performance impact.

anton-seaice commented 2 months ago

I created https://github.com/payu-org/payu/pull/496 to add checks for the iolayout numbers

anton-seaice commented 2 months ago

Closed through https://github.com/ACCESS-NRI/access-om3-configs/pull/114

COSIMA / access-om3

[0.25deg] update PE layout #214