COSIMA / mom6-panan

Pan-Antarctic regional configuration of MOM6
MIT License
6 stars 6 forks source link

MOM6-SIS2 scaling #20

Open adele-morrison opened 1 year ago

adele-morrison commented 1 year ago

Investigate the scaling of MOM6-SIS2 (probably the global-01 config). What are the bottlenecks, how can we use more cores?

For our upcoming compute bonanza, we will potentially be given up to 100k cores to utilise. Can we scale that big?

@AndyHoggANU volunteered @angus-g and @micaeljtoliveira to look into this.

adele-morrison commented 1 year ago

@aekiss's estimate: "If we retain the current tile size in the new 1/40°, that will be ~12k cores. To get to 100k cores, we need tiles that are sqrt(100/12) = 2/9 x smaller in each dimension. How many grid cells is that?"

adele-morrison commented 1 year ago

Summary from the meeting today:

@micaeljtoliveira it would be great if you could put some of your figures here that you showed in the meeting today.

micaeljtoliveira commented 1 year ago

Here is a summary of scaling tests up to now. The profiling information is obtained directly from the MOM6 output and provides some metrics for several profiling regions.

1/10th deg

The test configuration consist on a 3 month run. Runs start from a restart, so that restart IO is included in the timings. The number of cores used was: 483, 962, 1821 and 3870.

Here are some plots of parallel efficiency, speed-up and average time for selected regions

Average time (seconds) spent in the top-level regions:

Average time (seconds) spent in the different components:

Parallel speed-up for the top-level regions (dotted line indicates ideal speed-up):

Parallel speed-up for the different components (dotted line indicates ideal speed-up):

Parallel efficiency for the top-level regions:

Parallel efficiency for the different components:

In this case, a reasonable parallel efficiency is obtained up to 1821 cores. Currently the runs are being done with 962 cores, but it might be worth using 1821, specially because in that case there are only 3 idle cores (48x38 - 1821 = 3), instead of 46 (48x20 - 962 = 46).

1/20th deg

The test configuration consist on a 9 day run and also starts from a restart. The number of cores used was: 956, 1850, 3717 and 7430.

Here are some plots of parallel efficiency, speed-up and average time for selected regions.

Average time (seconds) spent in the top-level regions:

Average time (seconds) spent in the different components:

Parallel speed-up for the top-level regions (dotted line indicates ideal speed-up):

Parallel speed-up for the different components (dotted line indicates ideal speed-up):

Parallel efficiency for the top-level regions:

Parallel efficiency for the different components:

In the benchmark for this configuration the main loop takes a smaller percentage of the total runtime, so the effect of the initialization and termination on the parallel efficiency of the full run is probably overestimated. In any case, the parallel efficiency when using 3717 cores is still quite good.

micaeljtoliveira commented 1 year ago

Possible next step is to check the effect of changing the IO layout (currently is set to 1,1).

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-meeting-minutes-mar-2023/545/1