ESCOMP / CMEPS

NUOPC Community Mediator for Earth Prediction Systems
https://escomp.github.io/CMEPS/
24 stars 79 forks source link

Performance degrade for I compsets #258

Open ekluzek opened 3 years ago

ekluzek commented 3 years ago

We brought NUOPC as the default driver for CTSM in ctsm5.1.dev062. We are seeing a performance degrade of the test PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel with the NUOPC driver verses the MCT driver.

The PFS test shows the following:
Model Cost:             265.58   pe-hrs/simulated_year 
Model Throughput:       165.91   simulated_years/day 
The previous baseline for the MCT driver showed this
Model Cost:             214.41   pe-hrs/simulated_year 
Model Throughput:       205.52   simulated_years/day 
The RUN length in TestStatus for dev05* and dev06* versions varied from 58 to 81 seconds
  so about up to a 20% variation
Highwater memory mark is higher (67 vs 131 GB), and last usage (323 vs 165MB) lower than MCT
ekluzek commented 3 years ago

The timing files for a test for ctsm5.1.dev062 is here...

/glade/scratch/erik/tests_ctsm51d62acl/PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.GC.ctsm51d62acl_int/timing

Also to compare exactly here is the timing directory for ctsm5.1.dev061 run with NUOPC...

/glade/scratch/erik/ctsm5.1.dev061/cime/scripts/PFS_Vnuopc_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.20211118_171613_6cn3lu/timing

The timing for the MCT run of ctsm5.1.dev061 is here:

/glade/scratch/sacks/tests_1018-161929ch/PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.GC.1018-161929ch_int/timing

I'll upload some of the timing files in a bit.

ekluzek commented 3 years ago

Here's the NUOPC overall rates:

total pes active : 1836 mpi tasks per node : 36 pe count for cost estimate : 1836

Overall Metrics: Model Cost: 269.96 pe-hrs/simulated_year Model Throughput: 163.22 simulated_years/day

Init Time   :      90.109 seconds
Run Time    :      29.005 seconds        1.450 seconds/day
Final Time  :       0.299 seconds

Runs Time in total seconds, seconds/model-day, and model-years/wall-day CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time:      29.005 seconds        1.450 seconds/mday       163.22 myears/wday
CPL Run Time:       7.551 seconds        0.378 seconds/mday       626.95 myears/wday
ATM Run Time:       7.881 seconds        0.394 seconds/mday       600.75 myears/wday
LND Run Time:      23.766 seconds        1.188 seconds/mday       199.20 myears/wday
ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
ROF Run Time:       0.746 seconds        0.037 seconds/mday      6347.88 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL COMM Time:      4.540 seconds        0.227 seconds/mday      1042.81 myears/wday

And the MCT overall rates:

total pes active : 1836 mpi tasks per node : 36 pe count for cost estimate : 1836

Overall Metrics: Model Cost: 214.41 pe-hrs/simulated_year Model Throughput: 205.52 simulated_years/day

Init Time   :      40.784 seconds
Run Time    :      23.036 seconds        1.152 seconds/day
Final Time  :       0.012 seconds

Actual Ocn Init Wait Time     :       0.000 seconds
Estimated Ocn Init Run Time   :       0.000 seconds
Estimated Run Time Correction :       0.000 seconds
  (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time:      23.036 seconds        1.152 seconds/mday       205.52 myears/wday
CPL Run Time:      16.449 seconds        0.822 seconds/mday       287.81 myears/wday
ATM Run Time:       6.163 seconds        0.308 seconds/mday       768.17 myears/wday
LND Run Time:      20.886 seconds        1.044 seconds/mday       226.67 myears/wday
ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
ROF Run Time:       2.470 seconds        0.124 seconds/mday      1916.70 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL COMM Time:     11.861 seconds        0.593 seconds/mday       399.14 myears/wday

So interestingly the NUOPC rates are faster for: CPL-run, CPL-COMM, and ROF. While the main thing that's slower for NUOPC is LND itself, as well as the ratio between the LND run time and the TOT run time, which is odd when the CPL is faster. DATM is also slower, but since it's running concurrent with LND that doesn't matter.

ekluzek commented 3 years ago

Note, the NUOPC case gives the following warning in the timing file:

IMPORTANT: Large deviations between Connector times on different PETs are typically indicators of load imbalance in the system. The following Connectors in this profile may indicate a load imbalance:

And the lnd-run time for NUOPC is a min of 9.9 and max of 23.8 while MCT shows min of 8.3 and max of 20.9

I also wondered about the difference in streams, which should be encapsulated in the bgc_interp time for NUOPC bgc_interp min was 0.36 and max was 0.91 for MCT. bgc_interp min was 1.05 and max was 1.42 so it doesn't appear to be the difference in streams.

In comparing the timing of different parts of CTSM I don't see anything that sticks out as being a culprit.

ekluzek commented 3 years ago

NUOPC case

ekluzek commented 3 years ago

cesm.ESMF_Profile.summary.nuopc.txt cesm_timing_stats.mct.txt

jkshuman commented 2 years ago

This came up in today's SE meeting for slow regional runs with nuopc vs mct. @jkshuman running set of regional tests to look at timing and impact of setup changes from @ekluzek Mariana Sam Levis @billsacks.

billsacks commented 2 years ago

It looks like the cases leading to @jkshuman 's comment may not have been an apples-to-apples comparison. We're investigating further (see https://github.com/ESCOMP/CTSM/issues/1907).

wwieder commented 2 months ago

We'd like better performance with nuopc, but don't have an mct option any more. Close this issue?

wwieder commented 2 months ago

More broadly, it seems like nupoc is slower for everything but B cases. Is this the way things are supposed to work?

samsrabin commented 2 months ago

If we're okay with that, then close this issue. Otherwise, keep it open, but maybe it's not CTSM's responsibility as it affects other components too.

mvertens commented 1 month ago

@wwieder @samsrabin - extensive performance tests were carried out with the nuopc framework and it was determined that there would be a performance penalty on the order of ~5%. However, given the advantages of the new framework (no mapping files, exchange grid, antarctic-greenland coupling, creation of CDEPS) it was decided that the advantages far outweighed the performance cost. @jedwards4b can comment more since he helped with the performance analysis.