Closed minghangli-uni closed 2 months ago
You'll also need to update the config.yaml
for the new ncpus
and mem
Right, thanks @dougiesquire
I can't remember where we got those block sizes from, we should get better performance if we can reduce max_blocks (say to 10?) by setting the blocksizes differently.
Sorry I was wrong last week - we did put in a patch for max_blocks ... you can remove it from the namelist. Its still good to check the logs to get it to closer to 10.
The process would be - pick number of procs, then set block_size_x & block_size_y such that the blocks are close to square, and there are around 10 per PE (ideally also nx_global is divisible by block_size_x and ny_global is divisible by block_size_y)
We can also remove debug_blocks - but whilst setting the block size it provides useful information
I came across this issue again https://github.com/COSIMA/access-om3/issues/156, where I forgot to adjust the pio
settings after changing CICE layout.
pio_numiotasks = 5
pio_rearranger = 1
pio_root = 1
pio_stride = 48
@anton-seaice I understand the calculations, but could you please clarify why the ICE pio
settings are configured this way? Will this improve the performance?
The error message isn’t very intuitive, making it difficult for users to realise that they need to modify these parameters when changing the layout.
Can we revert it to the settings used in the 1deg configuration, here https://github.com/ACCESS-NRI/access-om3-configs/blob/2bc6107ef1b195aa62485a5d87c4ba834996d8cc/nuopc.runconfig#L364-L373?
ICE_modelio::
diro = ./log
logfile = ice.log
pio_async_interface = .false.
pio_netcdf_format = nothing
pio_numiotasks = 1
pio_rearranger = 1
pio_root = 0
pio_stride = 48
pio_typename = netcdf4p
I can't remember where we got those block sizes from
The block sizes were adopted from OM2 report, which specifies a CICE5 block size of 30x27
, with a square-ice
processor shape and roundrobin
distribution type.
Its still good to check the logs to get it to closer to 10.
I cant remember why having the number of blocks close to 10?
@anton-seaice I understand the calculations, but could you please clarify why the ICE pio settings are configured this way? Will this improve the performance?
In the old COSIMA TWG minutes from OM2 development (on the COSIMA website) the recommendation from NCI was to use one task per node. I think the Yang 2019 on Parallel I/O in MOM5 makes similar suggestion ? I guess there is a hardware benefit to one task per node. There's so many options there its hard to know what the best combination is without lots of work. e.g. we could also test having a dedicated IO PE, or changing the PIO_rearranger
I think one IO task per node is a good start. We could try just one IO task, it might not make much difference at this resolution.
The error message isn’t very intuitive, making it difficult for users to realise that they need to modify these parameters when changing the layout.
I agree, does it make a seperate ESMF log file ? I think they have names something like PETXX.ESMF...
The block sizes were adopted from OM2 report, which specifies a CICE5 block size of
30x27
, with asquare-ice
processor shape androundrobin
distribution type.
Ok thanks!
I cant remember why having the number of blocks close to 10?
From the cice docs :
Smaller, more numerous blocks provides an opportunity for better load balance by allocating each processor both ice-covered and ice-free blocks. But smaller, more numerous blocks becomes less efficient due to MPI communication associated with halo updates. In practice, blocks should probably not have fewer than about 8 to 10 grid cells in each direction, and more square blocks tend to optimize the volume-to-surface ratio important for communication cost. Often 3 to 8 blocks per processor provide the decompositions flexiblity to create reasonable load balance configurations.
So we should actually aim for number of blocks of 8 or less by the sounds of it :)
I think one IO task per node is a good start. We could try just one IO task, it might not make much difference at this resolution.
I agree for the current phase. I will do a test on the I/O tasks to verify the optimal configuration.
does it make a seperate ESMF log file ? I think they have names something like PETXX.ESMF...
This can be enabled by setting thiscreate_esmf_pet_files
to true
in drv_in
, but this should be used mostly for debugging purposes, not in production runs.
And would it be helpful to add a comment after the PE setup for ice_ntasks
to reference this issue?
For example: ice_ntasks = 96 #NB: Parallel I/O github.com/COSIMA/access-om3/issues/214
This would inform users meeting the issue about the current setup, and we can remove the comment once the I/O is optimised.
So we should actually aim for number of blocks of 8 or less by the sounds of it :)
The updated settings result in a max_blocks
of 5, in the range of 3-8 blocks per processor, which aligns with CICE docs.
&domain_nml
block_size_x = 60
block_size_y = 54
distribution_type = "roundrobin"
distribution_wght = "latitude"
maskhalo_bound = .true.
maskhalo_dyn = .true.
maskhalo_remap = .true.
max_blocks = -1
ns_boundary_type = "tripole"
nx_global = 1440
ny_global = 1080
processor_shape = "square-ice"
/
When setting max_blocks = -1
with the roundrobin
distribution type, the max_blocks
prescribed by CICE does not always match the actual number of ice blocks. E.g., with the above configuration, max_blocks
is set to 6, but the log shows a warning:
534 block_size_x,_y = 60 54
535 max_blocks = 6
536 Number of ghost cells = 1
537
538 (ice_read_global_nc) min, max, sum = -1.41413909065909
539 1.57079632679490 154674.873407807 ulat
540 (ice_read_global_nc) min, max, sum = 0.000000000000000E+000
541 1.00000000000000 969809.000000000 kmt
542 ice_domain work_unit, max_work_unit = 28035 10
543 ice_domain nocn = 0 280343 44787740
544 ice_domain work_per_block = 0 11 2204
545 ice: total number of blocks is 391
546 ********WARNING***********
547 (init_domain_distribution)
548 WARNING: ice no. blocks too large: decrease max to 5
Despite this warning, I don’t believe it will impact overall performance since MOM typically has a much higher computational load than CICE.
NB:
max_blocks = -1
with the rake
distribution type fails.
Why do you think max_blocks shouldn't be 5 ?
It can be 5, but we have to manually modify it to be 5
Oh sorry, I see now. That's something about the patch we put into access-om3 0.3.x for removing max_blocks, and the max_blocks calculation being approximate. When we update the cice version it should go away (after https://github.com/CICE-Consortium/CICE/pull/954)
It will allocate ~20% more memory than it uses , but it uses a small enough amount of memory there probably isn't a performance impact.
I created https://github.com/payu-org/payu/pull/496 to add checks for the iolayout numbers
Closed through https://github.com/ACCESS-NRI/access-om3-configs/pull/114
To update the PE layout for the 0.25 deg configuration,
nuopc.runconfig
,ice_in
andconfig.yaml
require corresponding modifications. Note: These changes will be updated when the configuration is revised.nuopc.runconfig
ice_in
config.yaml