Open ekluzek opened 3 years ago
The timing files for a test for ctsm5.1.dev062 is here...
/glade/scratch/erik/tests_ctsm51d62acl/PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.GC.ctsm51d62acl_int/timing
Also to compare exactly here is the timing directory for ctsm5.1.dev061 run with NUOPC...
/glade/scratch/erik/ctsm5.1.dev061/cime/scripts/PFS_Vnuopc_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.20211118_171613_6cn3lu/timing
The timing for the MCT run of ctsm5.1.dev061 is here:
/glade/scratch/sacks/tests_1018-161929ch/PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel.GC.1018-161929ch_int/timing
I'll upload some of the timing files in a bit.
Here's the NUOPC overall rates:
total pes active : 1836 mpi tasks per node : 36 pe count for cost estimate : 1836
Overall Metrics: Model Cost: 269.96 pe-hrs/simulated_year Model Throughput: 163.22 simulated_years/day
Init Time : 90.109 seconds
Run Time : 29.005 seconds 1.450 seconds/day
Final Time : 0.299 seconds
Runs Time in total seconds, seconds/model-day, and model-years/wall-day CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 29.005 seconds 1.450 seconds/mday 163.22 myears/wday
CPL Run Time: 7.551 seconds 0.378 seconds/mday 626.95 myears/wday
ATM Run Time: 7.881 seconds 0.394 seconds/mday 600.75 myears/wday
LND Run Time: 23.766 seconds 1.188 seconds/mday 199.20 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.746 seconds 0.037 seconds/mday 6347.88 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 4.540 seconds 0.227 seconds/mday 1042.81 myears/wday
And the MCT overall rates:
total pes active : 1836 mpi tasks per node : 36 pe count for cost estimate : 1836
Overall Metrics: Model Cost: 214.41 pe-hrs/simulated_year Model Throughput: 205.52 simulated_years/day
Init Time : 40.784 seconds
Run Time : 23.036 seconds 1.152 seconds/day
Final Time : 0.012 seconds
Actual Ocn Init Wait Time : 0.000 seconds
Estimated Ocn Init Run Time : 0.000 seconds
Estimated Run Time Correction : 0.000 seconds
(This correction has been applied to the ocean and total run times)
Runs Time in total seconds, seconds/model-day, and model-years/wall-day CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 23.036 seconds 1.152 seconds/mday 205.52 myears/wday
CPL Run Time: 16.449 seconds 0.822 seconds/mday 287.81 myears/wday
ATM Run Time: 6.163 seconds 0.308 seconds/mday 768.17 myears/wday
LND Run Time: 20.886 seconds 1.044 seconds/mday 226.67 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 2.470 seconds 0.124 seconds/mday 1916.70 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
IAC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 11.861 seconds 0.593 seconds/mday 399.14 myears/wday
So interestingly the NUOPC rates are faster for: CPL-run, CPL-COMM, and ROF. While the main thing that's slower for NUOPC is LND itself, as well as the ratio between the LND run time and the TOT run time, which is odd when the CPL is faster. DATM is also slower, but since it's running concurrent with LND that doesn't matter.
Note, the NUOPC case gives the following warning in the timing file:
IMPORTANT: Large deviations between Connector times on different PETs are typically indicators of load imbalance in the system. The following Connectors in this profile may indicate a load imbalance:
And the lnd-run time for NUOPC is a min of 9.9 and max of 23.8 while MCT shows min of 8.3 and max of 20.9
I also wondered about the difference in streams, which should be encapsulated in the bgc_interp time for NUOPC bgc_interp min was 0.36 and max was 0.91 for MCT. bgc_interp min was 1.05 and max was 1.42 so it doesn't appear to be the difference in streams.
In comparing the timing of different parts of CTSM I don't see anything that sticks out as being a culprit.
NUOPC case
This came up in today's SE meeting for slow regional runs with nuopc vs mct. @jkshuman running set of regional tests to look at timing and impact of setup changes from @ekluzek Mariana Sam Levis @billsacks.
It looks like the cases leading to @jkshuman 's comment may not have been an apples-to-apples comparison. We're investigating further (see https://github.com/ESCOMP/CTSM/issues/1907).
We'd like better performance with nuopc, but don't have an mct option any more. Close this issue?
More broadly, it seems like nupoc is slower for everything but B cases. Is this the way things are supposed to work?
If we're okay with that, then close this issue. Otherwise, keep it open, but maybe it's not CTSM's responsibility as it affects other components too.
@wwieder @samsrabin - extensive performance tests were carried out with the nuopc framework and it was determined that there would be a performance penalty on the order of ~5%. However, given the advantages of the new framework (no mapping files, exchange grid, antarctic-greenland coupling, creation of CDEPS) it was decided that the advantages far outweighed the performance cost. @jedwards4b can comment more since he helped with the performance analysis.
We brought NUOPC as the default driver for CTSM in ctsm5.1.dev062. We are seeing a performance degrade of the test PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel with the NUOPC driver verses the MCT driver.