E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 364 forks source link

Unexplained performance degradation on first job of case compared to subsequent jobs with pm-cpu/muller-cpu (using v3 HR F case) #6733

Open ndkeen opened 1 week ago

ndkeen commented 1 week ago

When running some longer cases, I noticed that the first job (say 3 months), had a much lower SYPD than all of the subsequent jobs in the same case, where the only difference is that the job was reading from local restart. The timing differences are in each daily time stamp, so not confined to init (or writing restarts, etc). Further, it looks like the additional time is in CAM_run3 which is just fairly odd.

This is with a E3SM v3 HR F cases on CPU. In this case it's muller-cpu, but other than it being slower in IO due to smaller scratch resources, it's always the same as pm-cpu. We might expect some differences in timing for init: maybe first job initializes differently than subsequent jobs all reading from local restarts -- but that diff would show in init timers. And here, I'm seeing the timing diff in every day timestamp.

Below are 2 examples of straight copies of a cases i ran on muller-cpu.

/global/cfs/cdirs/e3sm/ndk/tmpdata/20241009.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.test3.n256t120x1.coilr4.i2.dvsro

muller-login01% zgrep CAM_run3 case_scripts/timing/e3sm_timing_stats.868178.241029-164244.gz case_scripts/timing/e3sm_timing_stats.868249.241029-175657.gz 
case_scripts/timing/e3sm_timing_stats.868178.241029-164244.gz:"a:CAM_run3"                                     -      30720    30720 2.654208e+08   5.398307e+07  1831.666 (  6239      0)  1717.178 ( 22048      0)
case_scripts/timing/e3sm_timing_stats.868249.241029-175657.gz:"a:CAM_run3"                                        -      30720    30720 2.683699e+08   4.721440e+07  1619.878 (  6239      0)  1487.644 (  7904      0)

And i have another longer-running case that shows similar behavior.

muller-login01% pwd
/global/cfs/cdirs/e3sm/ndk/tmpdata/20241025.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.NoZMenh.n256t120x1.coilr4.i2.dvsro.hy
muller-login01% zgrep CAM_run3 case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz
case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz:"a:CAM_run3"                                     -      30720    30720 2.654208e+08   5.840219e+07  2020.120 (  1698      0)  1782.316 ( 13686      0)
case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz:"a:CAM_run3"                                        -      30720    30720 2.683699e+08   4.781989e+07  1633.105 (  6239      0)  1508.410 (  7904      0)

muller-login01% zgrep a:SLMM_csl case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz
case_scripts/timing/e3sm_timing_stats.867570.241027-121511.gz:"a:SLMM_csl"                                     -      30720    30720 5.308416e+08   1.004908e+07   698.403 (     8      0)   116.957 ( 29517      0)
case_scripts/timing/e3sm_timing_stats.867574.241027-133107.gz:"a:SLMM_csl"                                        -      30720    30720 5.367398e+08   1.015562e+07   494.818 (    17      0)   117.130 ( 27117      0)

I can also reproduce with even 5 day cases on only 90 nodes (above is on 256 nodes), but that's still with HR F case.

Sure, if the first job is first 3 months, second job is a diff 3 months -- but I can repeat this behavior over the same set of days. And, of course, we would not expect one time span to have such consistently significant performance diffs.

Note that for this launch script, it is a branch run with REF_CASE:

# Run options                                                                                                                                                                                                                                                                                                                                      
readonly MODEL_START_TYPE="branch"  # 'initial', 'continue', 'branch', 'hybrid'                                                                                                                                                                                                                                                                    
readonly START_DATE="0001-01-01"

# Additional options for 'branch' and 'hybrid'                                                                                                                                                                                                                                                                                                     
readonly GET_REFCASE=TRUE
#readonly RUN_REFDIR="/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/archive/rest/0038-01-01-00000"                                                                                                                                                                                   
readonly RUN_REFDIR="/dvs_ro/cfs/cdirs/e3sm/terai/E3SM_restarts/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/0038-01-01-00000"
readonly RUN_REFCASE="20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu"
readonly RUN_REFDATE="0038-01-01"
ambrad commented 1 week ago

How often do you see this? Below are some timers of interest for one pair of your runs. I see two main things, both related to the various boundary exchanges:

  1. The slower run is slower almost entirely because of comm taking longer (CEDR_local is a strange exception, but in any case its timer disparity is only a small part of the gap).
  2. The max vs min times for the boundary exchanges, regardless of run, seems wildly bad, or else I'm misinterpreting or misremembering something about this setup.
e3sm_timing_stats.867570.241027-121511
"CPL:RUN_LOOP"              30720 2.654208e+08   1.303592e+08  4244.343 (     0      0)  4243.153 ( 22407      0)
"CPL:ICE_RUN"                7680 6.635520e+07   3.033537e+05    43.880 (  8036      0)    38.151 ( 28920      0)
"CPL:LND_RUN"                7680 6.635520e+07   2.027057e+06   305.028 ( 19200      0)   239.313 ( 18768      0)
"CPL:OCNT_RUN"               7680 6.635520e+07   1.461513e+04     2.794 (     0      0)     1.783 (  1000      0)
"CPL:ATM_RUN"               30720 2.654208e+08   1.074550e+08  3583.485 (  8038      0)  3427.263 ( 23695      0)
"a:CAM_run2"                30720 2.654208e+08   2.038929e+07   799.999 ( 13687      0)   526.398 ( 28858      0)
"a:CAM_run3"                30720 2.654208e+08   5.840219e+07  2020.120 (  1698      0)  1782.316 ( 13686      0)
"a:prim_run_subcycle"       30720 5.308416e+08   5.839981e+07  2020.068 (  1698      0)  1782.253 ( 13686      0)
"a:compute_andor_apply_rhs" 30720 1.592525e+10   9.402567e+06   554.237 ( 27953      0)   191.621 (     4      0)
"a:caar_bexchV"             30720 1.592525e+10   5.719391e+06   477.866 ( 27953      0)    62.222 ( 24014      0)
"a:advance_hypervis"        30720 3.185050e+09   3.514140e+06   332.999 ( 29531      0)    64.734 (   149      0)
"a:vertical_remap"          30720 1.592525e+09   7.493347e+05    34.611 ( 11355      0)    16.241 ( 26824      0)
"a:PAT_remap"               30720 5.308416e+08   3.997493e+07  1398.849 (    15      0)  1136.026 ( 28318      0)
"a:SLMM_reconstruct"        30720 5.308416e+08   1.290741e+05     5.676 ( 21596      0)     2.754 ( 27208      0)
"a:SLMM_v2x"                30720 5.308416e+08   1.199458e+05     8.162 ( 18360      0)     2.668 ( 25962      0)
"a:SLMM_csl"                30720 5.308416e+08   1.004908e+07   698.403 (     8      0)   116.957 ( 29517      0)
"a:advance_hypervis_scalar" 30720 5.308416e+08   3.254983e+06   509.355 (  1438      0)    22.302 ( 19034      0)
"a:ah_scalar_bexchV"        30720 3.185050e+09   1.143516e+06   405.546 (   116      0)     6.085 (  9356      0)
"a:CEDR"                    30720 5.308416e+08   1.876471e+07   804.927 ( 27125      0)   279.160 ( 16476      0)
"a:CEDR_local"              30720 5.308416e+08   2.346945e+06   196.466 (     4      0)    15.948 ( 28016      0)
"a:SLMM_bexchV"             30720 5.308416e+08   2.095142e+06   189.647 ( 27659      0)    14.297 ( 26513      0)
"a:SLMM vertical remap"     30720 5.308416e+08   3.655281e+06   146.044 (   838      0)    76.343 ( 27948      0)
"a:CAM_run4"                30720 2.654208e+08   4.065067e+06   157.024 ( 26513      0)   113.838 (    15      0)
"a:CAM_run1"                30720 2.654208e+08   2.454508e+07   924.043 ( 28679      0)   732.546 ( 21353      0)
e3sm_timing_stats.867574.241027-133107
"CPL:RUN_LOOP"              30720 2.683699e+08   1.209881e+08  3939.252 ( 16384      0)  3938.133 (  3825      0)
"CPL:ICE_RUN"                7680 6.709248e+07   2.961514e+05    43.217 ( 15488      0)    37.141 ( 30360      0)
"CPL:LND_RUN"                7680 6.709248e+07   2.267965e+06   335.182 ( 19200      0)   268.936 ( 18768      0)
"CPL:OCNT_RUN"               7680 6.709248e+07   1.499465e+04     2.905 (     0      0)     1.830 (  7612      0)
"CPL:ATM_RUN"               30720 2.683699e+08   9.698622e+07  3246.981 (  8039      0)  3089.476 ( 15589      0)
"a:CAM_run2"                30720 2.683699e+08   2.045468e+07   723.949 ( 20687      0)   533.660 ( 28318      0)
"a:CAM_run3"                30720 2.683699e+08   4.781989e+07  1633.105 (  6239      0)  1508.410 (  7904      0)
"a:prim_run_subcycle"       30720 5.367398e+08   4.781749e+07  1633.059 (  6239      0)  1508.322 (  7904      0)
"a:compute_andor_apply_rhs" 30720 1.610220e+10   8.445844e+06   507.389 ( 29934      0)   202.657 ( 15675      0)
"a:caar_bexchV"             30720 1.610220e+10   4.772122e+06   429.949 ( 29934      0)    60.410 (  1575      0)
"a:advance_hypervis"        30720 3.220439e+09   3.025493e+06   278.846 ( 25436      0)    69.987 ( 10018      0)
"a:vertical_remap"          30720 1.610220e+09   7.612946e+05    33.287 ( 12838      0)    16.446 ( 29450      0)
"a:PAT_remap"               30720 5.367398e+08   3.081427e+07  1074.296 (    31      0)   846.176 ( 25438      0)
"a:SLMM_reconstruct"        30720 5.367398e+08   1.310209e+05     5.900 ( 12839      0)     2.788 ( 29449      0)
"a:SLMM_v2x"                30720 5.367398e+08   1.227579e+05     9.158 ( 20400      0)     2.698 ( 26865      0)
"a:SLMM_csl"                30720 5.367398e+08   1.015562e+07   494.818 (    17      0)   117.130 ( 27117      0)
"a:advance_hypervis_scalar" 30720 5.367398e+08   2.922591e+06   323.855 ( 28557      0)    17.472 ( 17726      0)
"a:ah_scalar_bexchV"        30720 3.220439e+09   9.652018e+05   240.438 ( 30353      0)     5.553 ( 24823      0)
"a:CEDR"                    30720 5.367398e+08   9.763367e+06   503.563 ( 28964      0)   214.416 (   117      0)
"a:CEDR_local"              30720 5.367398e+08   2.373188e+06   123.821 (    31      0)    16.209 ( 30356      0)
"a:SLMM_bexchV"             30720 5.367398e+08   2.117387e+06   188.598 ( 27957      0)    13.811 ( 25553      0)
"a:SLMM vertical remap"     30720 5.367398e+08   3.722375e+06   152.013 ( 16919      0)    77.841 ( 30412      0)
"a:CAM_run4"                30720 2.683699e+08   3.938087e+06   152.999 ( 28433      0)   117.487 (    31      0)
"a:CAM_run1"                30720 2.683699e+08   2.472444e+07   928.463 ( 28679      0)   741.893 ( 12785      0)
ambrad commented 1 week ago

Separately, have you looked at the variation across multiple initial runs? Is it possible the initial-vs-restart discrepancy is an artifact of undersampling the initial case?

It's of course always possible something really subtle related to restart is responsible, but the above timers really shouldn't be affected by the type of run.

rljacob commented 1 week ago

To clarify, the first run is an initial run and the rest are all restart runs?

ambrad commented 1 week ago

More thoughts:

It looks like these jobs run on 256 nodes (120 ranks/node -- leaving 8 free for system reasons, I'm guessing -- times 256 = 30720). That leaves a subset of nodes with 2 elements/rank instead of 3.

For studying performance, I would choose the number of nodes to be such that every rank gets the same number of elements, if possible. For 120 ranks per node and ne 120, that would be something like this sequence of number of nodes: 60 (12 elem/rank), 90 (8), 180 (4), 240 (3), 360 (2).

When studying a possible anomaly, I retreat to fewer nodes. It might be worth looking at this issue again on just 60 nodes.