E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
354 stars 368 forks source link

Intermittent runtime error in init: `surfrd_veg_all ERROR: sum of wt_cft not 1.0` on pm-cpu. Solved: at least 2 nodes are suspect #6469

Open ndkeen opened 5 months ago

ndkeen commented 5 months ago

With this test SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp, I see the following error:

 130:  surfrd_veg_all ERROR: sum of wt_cft not 1.0 at nl=       12131  and t=
 130:            1
 130:  sum is:   0.000000000000000E+000
 130:  ENDRUN:
 130:  ERROR in surfrdUtilsMod.F90 at line 75

and I'm pretty sure I've seen this same error (with same or similar test) before, which may suggest intermittent issue.

I see Rob also ran into https://github.com/E3SM-Project/E3SM/issues/6192

ndkeen commented 3 months ago

I see this again with ERS_Ld30.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_satphen on next of Aug27th test

crterai commented 2 months ago

I also ran into this error twice while running an v3.HR F2010 case (out of 6 recent submissions) First case: job id 29807591

 1442:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=       17311  and t=
 1442:            1
 1442:  sum is:    2.00000068306984
 1442:  ENDRUN:
 1442:  ERROR in surfrdUtilsMod.F90 at line 75

Second case: job id 29997275.240831

28750:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=      345011  and t=
28750:            1
28750:  sum is:    1.83000000902540
28750:  ENDRUN:
28750:  ERROR in surfrdUtilsMod.F90 at line 75
/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015_plus4K.ne120pg2_r025_icos30.oro_conv_gw_tunings.pm-cpu/run

NDK: For these 2 jobids, the first contains nid004324, the second does not

wlin7 commented 2 months ago

And here are the ne512 land-only spin-up runs that encountered the same error. The failure occurs randomly (resubmit, sometimes more than once, can overcome, and in repeated failures, the reported error can occur at different column with different sum). All occur during initialization. The first number after PID is the process element id, as seen in e3sm.log. The runs were using pm-cpu.

28289399.240719-160327 28746: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1654999 28490072.240724-142113 1154: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 66493 29745228.240827-193708 15562: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 895981 29906097.240829-064757 1282: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 73863 29425140.240816-021759 15818: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1819771

NDK: It was most useful for Wuyin to include job ids here.

Of those 5 above, 28490072 and 29906097 included the potential bad node.

where 28289399,29745228,29425140 did not
ndkeen commented 2 months ago

Minor update: It is starting to look like this issue (as well as other similar one linked) may not have a testcase dependency (ie compset/res). I don't think it will be easy to reproduce these with the smaller tests, as the frequency seems rare. But as Wuyin reports several of these with this setup, there might be a greater chance of hitting error with this (possibly simply due to larger number of MPI's?). I could either try to reproduce what Wuyin is doing, or simply run the cases above with more MPI's.

I'm also trying to update the intel compiler (with other module version changes) in https://github.com/E3SM-Project/E3SM/pull/6596 so I will try a few tests with that (but again, if frequency is rare, may not be easy to test if this has any impact at all).

ndkeen commented 2 months ago

With next of 9/15, the following tests hit this error:

ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface
ERS_Vmoab.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-mach-pet (case2run)
ndkeen commented 2 months ago

As we had several tests hit this error (normally 0, every now and then 1), I tried to see if I could repeat with one of the 1-node tests above ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics. I launched 32 of these tests with a Sep6th checkout of next and another 32 with my branch to update intel compiler. All of them passed (where as it's ERS, each test goes thru init twice, so that's 128 passes thru init without issue).

I also tried several tests with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way as well as same test using more tasks ERS_P1024.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way. All passing so far. About 24 cases with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way on next of Sept6th Another 24 cases with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way using newer Intel.

And then about 15 cases with ERS_P1024.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way using Sep6th next and about 10 cases of same test using newer Intel compiler. All passing.

crterai commented 2 months ago

Just to document here that another submission of an F2010 case at ne120 ran into the error:

/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/run/e3sm.log.30555710.240918-042651

  122:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        1475  and t=
  122:            1
  122:  sum is:    1.93000027398218
  122:  ENDRUN:
  122:  ERROR in surfrdUtilsMod.F90 at line 75
  122:
  122:
  122:
  122:
  122:
  122:
  122:  ERROR: Unknown error submitted to shr_abort_abort.

NDK:noting this job includes the 4324 bad node

wlin7 commented 2 months ago

Note that this appears to be more than one type of errors. The issue was created for incidents of surfrd_veg_all ERROR: sum of wt_cft not 1.0 Recently with higher resolution, we mainly saw the occurrence of sum of wt_nat_patch not 1.0 at high frequency.

Back tracing is the same -- calling the same check_sums routine for different fields.

ndkeen commented 2 months ago

Thanks. Wuyin also indicated that he is using a version of code that includes the update to Intel compiler version.

Since we updated this (~Sep19th), I've not seen any more error of this sort on cdash -- and I've been running quite a few benchmarks jobs on pm-cpu (and muller-cpu, almost identical) with updated compiler version -- no errors like this yet.

I certainly didn't think the compiler version would "fix" it.

lxu16 commented 2 months ago

I ran into similar errors with the compset F20TR and resolution "ne30pg2_r05_IcoswISC30E3r5"

surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        3305  and t=1
sum is:   0.967455393474239     
ENDRUN:
ERROR in surfrdUtilsMod.F90 at line 75  

 ERROR: Unknown error submitted to shr_abort_abort.
 surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=       93107  and t=           1
 sum is:   0.000000000000000E+000
 ENDRUN:
 ERROR in surfrdUtilsMod.F90 at line 75                                         

ERROR: Unknown error submitted to shr_abort_abort.
 surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=       94312  and t=           1
sum is:   0.000000000000000E+000
ENDRUN:
ERROR in surfrdUtilsMod.F90 at line 75                                         
ERROR: Unknown error submitted to shr_abort_abort.

Any clue to solve this issue?

ndkeen commented 2 months ago

I actually had two 270-node F cases that failed. One of each variety:

  962:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=       11555  and t=
  962:            1
  962:  sum is:    1.59000001978980     
  962:  ENDRUN:
  962:  ERROR in surfrdUtilsMod.F90 at line 75                                         

and

 9728:   ERROR: sum of areas on globe does not equal 4*pi

Case: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23/f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb jobid: 31201149 which does include the potential bad node 4324

Then while testing a potential fix I found to a different issue in init that I've been struggling with, I have seen two passes with this same 270-node setup. Certainly not conclusive, but this is easy/safe thing to try.

OK case: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23/f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb.base.barr

The potential fix/hack of adding MPI_Barrier before a certain MPI_AllReduce described here: https://github.com/E3SM-Project/E3SM/issues/6655

With testing on muller-cpu, I've actually been unable to reproduce these errors (of not summing to 1.0) -- the only issue I've had so far are stalls/hangs. I've run 300-400 cases at different resolution/node-counts.

ndkeen commented 2 months ago

Wuyin gave me a land-only launch script that he had been using recently on pm-cpu and encountering the error noted above more frequently. I tried on muller-cpu.

readonly COMPSET="2010_DATM%ERA56HR_ELM%CNPRDCTCBCTOP_SICE_SOCN_MOSART_SGLC_SWAV"
readonly RESOLUTION="ne120pg2_r025_RRSwISC6to18E3r5"

I ran it with 8,16,32,64,128,256 nodes and have yet to see a fail of the sort we see on pm-cpu.

I'm not yet sure what this means -- seems low percentage that the slingshot changes on muller-cpu could be impacting this.

However, I do see hangs in init at 256 nodes. The Barrier hack noted above does not seem to fix, but the libfabric setting does allow it to run ok every time (so far). This might be enough evidence to say the hanging-in-init issue is just different than sum-of-values-not-always-1 issue.

ndkeen commented 1 month ago

With my testing on muller-cpu (which again is using newer slighshot SW coming soon to pm-cpu), I'm finding that: a) everything works as before (for e3sm/scream) at various resolutions. no obvious measurable perf diff either. b) there are some cases (F/I cases at ne120) that are hanging at higher node counts c) for all cases that hang, using FI_MR_CACHE_MONITOR=kdreg2 seems to resolve d) there may be other work-arounds we can use to prevent needing this env var, but have not worked in all situations

Now this might not even be related to this current issue above. It's just something we should consider trying even now on pm-cpu. Can add this to config_machines.xml to affect new cases: <env name="FI_MR_CACHE_MONITOR">kdreg2</env>

There might be a small perf impact with kdreg2 or maybe nothing -- very similar timing.

I can try to better describe what this is doing, but my understanding is that it's something newer HPE is working on.

Just adding more info on that env var here for completeness: The default is FI_MR_CACHE_MONITOR=userfaultfd

# Define a default memory registration monitor. The monitor checks for virtual to physical memory address changes. Options are: kdreg2, memhooks, userfaultfd and disabled. Kdreg2 is supplied as a loadable Linux kernel module. Memhooks operates by intercepting memory allocation and free calls. Userfaultfd is a Linux kernel feature. 'memhooks' is the default if available on the system. The 'disabled' option disables memory caching.

ndkeen commented 1 month ago

Danqing had same error with F-case. jobid: 30483554 https://pace.ornl.gov/exp-details/191658

ndkeen commented 1 month ago

As with the other issue, I do see that it looks like there is at least one "bad node" on pm-cpu. If specifically ask for nid004324, I see these errors for these test cases:

ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics.sus/run/e3sm.log.31574808.241008-030200:  2:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        1884  and t=
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface.sus/run/e3sm.log.31574809.241008-030122:  2:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=         493  and t=
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way.sus/run/e3sm.log.31574861.241008-030947:  2:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        2229  and t=
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-

jobids:

31574808, 31574809, 31574861

Note this compute node was not used in some of the other failing jobs above.

ndkeen commented 1 month ago

I think I have found the other bad node.
I submit we will always see this error if either of these 2 nodes are used. And will not see crash if they are avoided:

nid006855
nid004324

To submit a job that will avoid these 2: case.submit -a="-x nid004324,nid006855"

Working with NERSC now and they have removed 4324 from pool, but letting me test on it.

ndkeen commented 1 month ago

Testing on the 4324 node, I have a learned a few things:

1) Intel optimize cases affected will always fail in same way -- still trying to learn what types of cases are affected (for example this case does not fail SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP) 2) GNU optimize cases also fail -- but have different error message (below) 3) With Intel and GNU, the DEBUG cases do not fail

  2: SNICAR ERROR: negative absoption : -0.641576E-01 at timestep:      9 at column:   4053
  2:  SNICAR_AD STATS: snw_rds(0)=           55
  2:  SNICAR_AD STATS: L_snw(0)=    3.3222098821765529E-002
  2:  SNICAR_AD STATS: h2osno=    3.3222098821765529E-002  snl=           -1
  2:  SNICAR_AD STATS: soot1(0)=    0.0000000000000000     
  2:  SNICAR_AD STATS: soot2(0)=    0.0000000000000000     
  2:  SNICAR_AD STATS: dust1(0)=    0.0000000000000000     
  2:  SNICAR_AD STATS: dust2(0)=    0.0000000000000000     
  2:  SNICAR_AD STATS: dust3(0)=    0.0000000000000000     
  2:  SNICAR_AD STATS: dust4(0)=    0.0000000000000000     
  2:  calling getglobalwrite with decomp_index=         4053  and elmlevel= column
  2:  local  column   index =         4053
  2:  global column   index =       186402
  2:  global landunit index =        58074
  2:  global gridcell index =        16804
  2:  gridcell longitude    =    152.50000000000000     
  2:  gridcell latitude     =    58.900523560209386     
  2:  column   type         =            1
  2:  landunit type         =            1
  2:  ENDRUN:ERROR in /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SnowSnicarMod.F90 at line 2934                                                                                                                                                                                     \

  2:  ERROR: Unknown error submitted to shr_abort_abort.
  2: #0  0xd33baa in __shr_abort_mod_MOD_shr_abort_backtrace
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/share/util/shr_abort_mod.F90:104
  2: #1  0xd33d80 in __shr_abort_mod_MOD_shr_abort_abort
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/share/util/shr_abort_mod.F90:61
  2: #2  0x7f39ed in __snowsnicarmod_MOD_snicar_ad_rt
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SnowSnicarMod.F90:2934
  2: #3  0x85067e in __surfacealbedomod_MOD_surfacealbedo
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SurfaceAlbedoMod.F90:637
  2: #4  0x52e9f7 in __elm_driver_MOD_elm_drv
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/main/elm_driver.F90:1376
  2: #5  0x516ac9 in __lnd_comp_mct_MOD_lnd_run_mct
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/cpl/lnd_comp_mct.F90:617
  2: #6  0x48118a in __component_mod_MOD_component_run
  2:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/driver-mct/main/component_mod.F90:757
  2: #7  0x46fb07 in __cime_comp_mod_MOD_cime_run
ndkeen commented 1 month ago

I ran e3sm_developer only on nid004324 (where only 1 node jobs were allowed) with both intel and gnu. The idea is to verify we always get these fails on this node -- but also to see what type of fails this node might also have been causing. And, is it possible that a case still continues... How would we know?

ERIO.ne30_g16_rx1.A.pm-cpu_gnu.wnid004324ed                                                         pass                   nodes=   1 mins= 14.3 state= COMPLETED  notes=
ERIO.ne30_g16_rx1.A.pm-cpu_intel.wnid004324ed                                                       fail COMPARE_netcdf4c_ nodes=   1 mins= 11.6 state= COMPLETED  notes=
ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed                                                 fail               RUN nodes=   1 mins=  2.7 state=    FAILED  notes=
ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed                                               fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_gnu.elm-usrpft_codetest_I1850CNPRDCTCBC.wnid004324ed          pass                   nodes=   1 mins=  1.1 state= COMPLETED  notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_gnu.elm-usrpft_default_I1850CNPRDCTCBC.wnid004324ed           pass                   nodes=   1 mins=  0.9 state= COMPLETED  notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_intel.elm-usrpft_codetest_I1850CNPRDCTCBC.wnid004324ed        pass                   nodes=   1 mins=  1.7 state= COMPLETED  notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_intel.elm-usrpft_default_I1850CNPRDCTCBC.wnid004324ed         pass                   nodes=   1 mins=  1.3 state= COMPLETED  notes=
ERS.ELM_USRDAT.I1850ELM.pm-cpu_gnu.elm-usrdat.wnid004324ed                                          pass                   nodes=   1 mins=  1.1 state= COMPLETED  notes=
ERS.ELM_USRDAT.I1850ELM.pm-cpu_intel.elm-usrdat.wnid004324ed                                        pass                   nodes=   1 mins=  2.2 state= COMPLETED  notes=
ERS.ELM_USRDAT.IELM.pm-cpu_gnu.elm-surface_water_dynamics.wnid004324ed                              fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics.wnid004324ed                            fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.MOS_USRDAT.RMOSGPCC.pm-cpu_gnu.mosart-mos_usrdat.wnid004324ed                                   pass                   nodes=   1 mins=  1.4 state= COMPLETED  notes=
ERS.MOS_USRDAT.RMOSGPCC.pm-cpu_intel.mosart-mos_usrdat.wnid004324ed                                 fail COMPARE_base_rest nodes=   1 mins=  1.4 state= COMPLETED  notes=
ERS.MOS_USRDAT.RMOSNLDAS.pm-cpu_gnu.mosart-sediment.wnid004324ed                                    pass                   nodes=   1 mins=  2.6 state= COMPLETED  notes=
ERS.MOS_USRDAT.RMOSNLDAS.pm-cpu_intel.mosart-sediment.wnid004324ed                                  fail COMPARE_base_rest nodes=   1 mins=  3.1 state= COMPLETED  notes=
ERS.f09_g16.I1850ELMCN.pm-cpu_gnu.elm-bgcinterface.wnid004324ed                                     fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface.wnid004324ed                                   fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.I1850GSWCNPRDCTCBC.pm-cpu_gnu.elm-vstrd.wnid004324ed                                    fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.f09_g16.I1850GSWCNPRDCTCBC.pm-cpu_intel.elm-vstrd.wnid004324ed                                  fail               RUN nodes=   1 mins=  1.3 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.IELMBC.pm-cpu_gnu.elm-simple_decomp.wnid004324ed                                        fail COMPARE_base_rest nodes=   1 mins=  3.6 state= COMPLETED  notes=
ERS.f09_g16.IELMBC.pm-cpu_gnu.wnid004324ed                                                          fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.f09_g16.IELMBC.pm-cpu_intel.elm-simple_decomp.wnid004324ed                                      fail               RUN nodes=   1 mins=  1.2 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.IELMBC.pm-cpu_intel.wnid004324ed                                                        fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16_g.MALISIA.pm-cpu_gnu.wnid004324ed                                                       fail               RUN nodes=   1 mins=  3.4 state=    FAILED  notes=
ERS.f09_g16_g.MALISIA.pm-cpu_intel.wnid004324ed                                                     fail COMPARE_base_rest nodes=   1 mins=  0.8 state= COMPLETED  notes=
ERS.f19_f19.I1850ELMCN.pm-cpu_gnu.wnid004324ed                                                      fail COMPARE_base_rest nodes=   1 mins=  1.8 state= COMPLETED  notes=
ERS.f19_f19.I1850ELMCN.pm-cpu_intel.wnid004324ed                                                    fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_f19.I20TRELMCN.pm-cpu_gnu.wnid004324ed                                                      fail COMPARE_base_rest nodes=   1 mins=  3.0 state= COMPLETED  notes=
ERS.f19_f19.I20TRELMCN.pm-cpu_intel.wnid004324ed                                                    fail               RUN nodes=   1 mins=  0.7 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNECACNTBC.pm-cpu_gnu.elm-eca.wnid004324ed                                         fail COMPARE_base_rest nodes=   1 mins=  1.9 state= COMPLETED  notes=
ERS.f19_g16.I1850CNECACNTBC.pm-cpu_intel.elm-eca.wnid004324ed                                       fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNECACTCBC.pm-cpu_gnu.elm-eca.wnid004324ed                                         fail COMPARE_base_rest nodes=   1 mins=  2.3 state= COMPLETED  notes=
ERS.f19_g16.I1850CNECACTCBC.pm-cpu_intel.elm-eca.wnid004324ed                                       fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNRDCTCBC.pm-cpu_gnu.elm-rd.wnid004324ed                                           fail COMPARE_base_rest nodes=   1 mins=  1.8 state= COMPLETED  notes=
ERS.f19_g16.I1850CNRDCTCBC.pm-cpu_intel.elm-rd.wnid004324ed                                         fail               RUN nodes=   1 mins=  0.7 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr.wnid004324ed                                               fail COMPARE_base_rest nodes=   1 mins= 13.6 state= COMPLETED  notes=
ERS.f19_g16.I1850ELM.pm-cpu_gnu.elm-vst.wnid004324ed                                                fail COMPARE_base_rest nodes=   1 mins=  2.5 state= COMPLETED  notes=
ERS.f19_g16.I1850ELM.pm-cpu_intel.elm-betr.wnid004324ed                                             fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850ELM.pm-cpu_intel.elm-vst.wnid004324ed                                              fail               RUN nodes=   1 mins=  1.2 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850GSWCNPECACNTBC.pm-cpu_gnu.elm-eca_f19_g16_I1850GSWCNPECACNTBC.wnid004324ed         fail COMPARE_base_rest nodes=   1 mins=  2.3 state= COMPLETED  notes=
ERS.f19_g16.I1850GSWCNPECACNTBC.pm-cpu_intel.elm-eca_f19_g16_I1850GSWCNPECACNTBC.wnid004324ed       fail               RUN nodes=   1 mins=  1.1 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I20TRGSWCNPECACNTBC.pm-cpu_gnu.elm-eca_f19_g16_I20TRGSWCNPECACNTBC.wnid004324ed         fail COMPARE_base_rest nodes=   1 mins=  2.2 state= COMPLETED  notes=
ERS.f19_g16.I20TRGSWCNPECACNTBC.pm-cpu_intel.elm-eca_f19_g16_I20TRGSWCNPECACNTBC.wnid004324ed       fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I20TRGSWCNPRDCTCBC.pm-cpu_gnu.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC.wnid004324ed           fail               RUN nodes=   1 mins=  1.0 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.f19_g16.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC.wnid004324ed         fail               RUN nodes=   1 mins=  0.7 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.IERA56HRELM.pm-cpu_gnu.wnid004324ed                                                     fail COMPARE_base_rest nodes=   1 mins=  2.5 state= COMPLETED  notes=
ERS.f19_g16.IERA56HRELM.pm-cpu_intel.wnid004324ed                                                   fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.IERA5ELM.pm-cpu_gnu.wnid004324ed                                                        fail COMPARE_base_rest nodes=   1 mins=  3.2 state= COMPLETED  notes=
ERS.f19_g16.IERA5ELM.pm-cpu_intel.wnid004324ed                                                      fail               RUN nodes=   1 mins=  1.1 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed                                                           pass                   nodes=   1 mins=  1.1 state= COMPLETED  notes=
ERS.f19_g16_rx1.A.pm-cpu_intel.wnid004324ed                                                         fail COMPARE_base_rest nodes=   1 mins=  2.3 state= COMPLETED  notes=
ERS.ne30_g16_rx1.A.pm-cpu_gnu.wnid004324ed                                                          pass                   nodes=   1 mins=  1.2 state= COMPLETED  notes=
ERS.ne30_g16_rx1.A.pm-cpu_intel.wnid004324ed                                                        fail COMPARE_base_rest nodes=   1 mins=  2.2 state= COMPLETED  notes=
ERS.r05_r05.ICNPRDCTCBC.pm-cpu_gnu.elm-cbudget.wnid004324ed                                         fail COMPARE_base_rest nodes=   1 mins= 13.2 state= COMPLETED  notes=
ERS.r05_r05.ICNPRDCTCBC.pm-cpu_intel.elm-cbudget.wnid004324ed                                       fail               RUN nodes=   1 mins=  1.1 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_cft not 1.0
ERS.r05_r05.IELM.pm-cpu_gnu.elm-V2_ELM_MOSART_features.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.r05_r05.IELM.pm-cpu_gnu.elm-lnd_rof_2way.wnid004324ed                                           fail               RUN nodes=   1 mins=  2.4 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS.r05_r05.IELM.pm-cpu_intel.elm-V2_ELM_MOSART_features.wnid004324ed                               fail               RUN nodes=   1 mins=  1.0 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_cft not 1.0
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way.wnid004324ed                                         fail               RUN nodes=   1 mins=  1.6 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.r05_r05.RMOSGPCC.pm-cpu_gnu.mosart-gpcc_1972.wnid004324ed                                       pass                   nodes=   1 mins=  1.9 state= COMPLETED  notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_gnu.mosart-heat.wnid004324ed                                            pass                   nodes=   1 mins=  2.0 state= COMPLETED  notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_intel.mosart-gpcc_1972.wnid004324ed                                     fail COMPARE_base_rest nodes=   1 mins=  2.5 state= COMPLETED  notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_intel.mosart-heat.wnid004324ed                                          fail COMPARE_base_rest nodes=   1 mins=  1.8 state= COMPLETED  notes=
ERS_D.f09_f09.IELM.pm-cpu_gnu.elm-koch_snowflake.wnid004324ed                                       pass                   nodes=   1 mins=  2.9 state= COMPLETED  notes=
ERS_D.f09_f09.IELM.pm-cpu_gnu.elm-solar_rad.wnid004324ed                                            pass                   nodes=   1 mins=  4.0 state= COMPLETED  notes=
ERS_D.f09_f09.IELM.pm-cpu_intel.elm-koch_snowflake.wnid004324ed                                     pass                   nodes=   1 mins=  4.2 state= COMPLETED  notes=
ERS_D.f09_f09.IELM.pm-cpu_intel.elm-solar_rad.wnid004324ed                                          pass                   nodes=   1 mins=  4.4 state= COMPLETED  notes=
ERS_D.f09_g16.I1850ELMCN.pm-cpu_gnu.wnid004324ed                                                    pass                   nodes=   1 mins=  4.3 state= COMPLETED  notes=
ERS_D.f09_g16.I1850ELMCN.pm-cpu_intel.wnid004324ed                                                  pass                   nodes=   1 mins=  8.2 state= COMPLETED  notes=
ERS_D.f19_f19.IELM.pm-cpu_gnu.elm-ic_f19_f19_ielm.wnid004324ed                                      pass                   nodes=   1 mins=  1.7 state= COMPLETED  notes=
ERS_D.f19_f19.IELM.pm-cpu_intel.elm-ic_f19_f19_ielm.wnid004324ed                                    pass                   nodes=   1 mins=  2.6 state= COMPLETED  notes=
ERS_D.f19_g16.I1850GSWCNPRDCTCBC.pm-cpu_gnu.elm-ctc_f19_g16_I1850GSWCNPRDCTCBC.wnid004324ed         pass                   nodes=   1 mins=  2.5 state= COMPLETED  notes=
ERS_D.f19_g16.I1850GSWCNPRDCTCBC.pm-cpu_intel.elm-ctc_f19_g16_I1850GSWCNPRDCTCBC.wnid004324ed       pass                   nodes=   1 mins=  3.8 state= COMPLETED  notes=
ERS_D.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-hommexx.wnid004324ed                                       pass                   nodes=   1 mins=  7.8 state= COMPLETED  notes=
ERS_D.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-hommexx.wnid004324ed                                     pass                   nodes=   1 mins= 11.3 state= COMPLETED  notes=
ERS_D.ne4pg2_oQU480.I20TRELM.pm-cpu_gnu.elm-disableDynpftCheck.wnid004324ed                         pass                   nodes=   1 mins=  1.4 state= COMPLETED  notes=
ERS_D.ne4pg2_oQU480.I20TRELM.pm-cpu_intel.elm-disableDynpftCheck.wnid004324ed                       pass                   nodes=   1 mins=  1.7 state= COMPLETED  notes=
ERS_D_Ld15.f45_g37.IELMFATES.pm-cpu_gnu.elm-fates_cold_treedamage.wnid004324ed                      pass                   nodes=   1 mins=  2.7 state= COMPLETED  notes=
ERS_D_Ld15.f45_g37.IELMFATES.pm-cpu_intel.elm-fates_cold_treedamage.wnid004324ed                    pass                   nodes=   1 mins=  3.5 state= COMPLETED  notes=
ERS_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates.wnid004324ed                                        fail COMPARE_base_rest nodes=   1 mins=  2.4 state= COMPLETED  notes=
ERS_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates.wnid004324ed                                      fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2.wnid004324ed                              fail COMPARE_base_rest nodes=   1 mins=  2.2 state= COMPLETED  notes=
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2_ftype0.wnid004324ed                       fail COMPARE_base_rest nodes=   1 mins=  3.0 state= COMPLETED  notes=
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2.wnid004324ed                            fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2_ftype0.wnid004324ed                     fail               RUN nodes=   1 mins=  0.4 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
ERS_Ld30.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_satphen.wnid004324ed                                fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=SNICAR ERROR: negative absoption
ERS_Ld30.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_satphen.wnid004324ed                              fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld30.f45_g37.IELMFATES.pm-cpu_gnu.elm-fates_cold_sizeagemort.wnid004324ed                       fail COMPARE_base_rest nodes=   1 mins=  3.0 state= COMPLETED  notes=
ERS_Ld30.f45_g37.IELMFATES.pm-cpu_intel.elm-fates_cold_sizeagemort.wnid004324ed                     fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld5.T62_oQU120.CMPASO-NYF.pm-cpu_gnu.wnid004324ed                                               pass                   nodes=   1 mins=  2.7 state= COMPLETED  notes=
ERS_Ld5.T62_oQU120.CMPASO-NYF.pm-cpu_intel.wnid004324ed                                             fail COMPARE_base_rest nodes=   1 mins=  3.4 state= COMPLETED  notes=
ERS_Ld5.T62_oQU240.DTESTM.pm-cpu_gnu.wnid004324ed                                                   pass                   nodes=   1 mins=  1.4 state= COMPLETED  notes=
ERS_Ld5.T62_oQU240.DTESTM.pm-cpu_intel.wnid004324ed                                                 fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed                                   pass                   nodes=   1 mins=  2.5 state= COMPLETED  notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed                                   pass                   nodes=   1 mins=  2.2 state= COMPLETED  notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.7 state=    FAILED  notes=
ERS_Ld5.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.pm-cpu_gnu.mpaso-ocn_glcshelf.wnid004324ed       fail               RUN nodes=   1 mins=  1.6 state=    FAILED  notes=
ERS_Ld5.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.pm-cpu_intel.mpaso-ocn_glcshelf.wnid004324ed     fail               RUN nodes=   1 mins=  1.5 state=    FAILED  notes=
ERS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-cpu_gnu.eam-mmf_crmout.wnid004324ed                             fail COMPARE_base_rest nodes=   1 mins=  3.8 state= COMPLETED  notes=
ERS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-cpu_intel.eam-mmf_crmout.wnid004324ed                           fail               RUN nodes=   1 mins=  0.4 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
NCK.f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed                                                           pass                   nodes=   1 mins=  1.4 state= COMPLETED  notes=
NCK.f19_g16_rx1.A.pm-cpu_intel.wnid004324ed                                                         pass                   nodes=   1 mins=  2.9 state= COMPLETED  notes=
PEM_Ln5.T62_oQU240wLI.DTESTM.pm-cpu_gnu.wnid004324ed                                                fail COMPARE_base_modp nodes=   1 mins=  2.2 state= COMPLETED  notes=
PEM_Ln5.T62_oQU240wLI.DTESTM.pm-cpu_intel.wnid004324ed                                              fail               RUN nodes=   1 mins=  1.4 state=    FAILED  notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed                                   fail COMPARE_base_modp nodes=   1 mins=  1.9 state= COMPLETED  notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed                                 fail               RUN nodes=   1 mins=  2.1 state=    FAILED  notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed                                   fail COMPARE_base_modp nodes=   1 mins=  1.7 state= COMPLETED  notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=
PET_Ln5.T62_oQU240.DTESTM.pm-cpu_gnu.wnid004324ed                                                   pass                   nodes=   1 mins=  1.3 state= COMPLETED  notes=
PET_Ln5.T62_oQU240.DTESTM.pm-cpu_intel.wnid004324ed                                                 fail COMPARE_base_sing nodes=   1 mins=  0.9 state= COMPLETED  notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed                                   pass                   nodes=   1 mins=  1.4 state= COMPLETED  notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed                                 fail COMPARE_base_sing nodes=   1 mins=  0.8 state= COMPLETED  notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed                                   pass                   nodes=   1 mins=  1.2 state= COMPLETED  notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed                                 fail COMPARE_base_sing nodes=   1 mins=  1.1 state= COMPLETED  notes=
SEQ.f19_g16.X.pm-cpu_gnu.wnid004324ed                                                               pass                   nodes=   1 mins=  2.5 state= COMPLETED  notes=
SEQ.f19_g16.X.pm-cpu_intel.wnid004324ed                                                             fail  COMPARE_base_seq nodes=   1 mins=  2.6 state= COMPLETED  notes=
SMS.MOS_USRDAT.RMOSGPCC.pm-cpu_gnu.mosart-unstructure.wnid004324ed                                  pass                   nodes=   1 mins=  0.9 state= COMPLETED  notes=
SMS.MOS_USRDAT.RMOSGPCC.pm-cpu_intel.mosart-unstructure.wnid004324ed                                pass                   nodes=   1 mins=  0.9 state= COMPLETED  notes=
SMS.ne30_f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed                                                      pass                   nodes=   1 mins=  0.9 state= COMPLETED  notes=
SMS.ne30_f19_g16_rx1.A.pm-cpu_intel.wnid004324ed                                                    pass                   nodes=   1 mins=  0.7 state= COMPLETED  notes=
SMS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-cosplite.wnid004324ed                                        fail               RUN nodes=   1 mins=  1.0 state=    FAILED  notes=
SMS.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-cosplite.wnid004324ed                                      fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS.r05_r05.I1850ELMCN.pm-cpu_gnu.elm-qian_1948.wnid004324ed                                        fail               RUN nodes=   1 mins=  1.2 state=    FAILED  notes=SNICAR ERROR: negative absoption
SMS.r05_r05.I1850ELMCN.pm-cpu_intel.elm-qian_1948.wnid004324ed                                      fail               RUN nodes=   1 mins=  1.2 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS.r05_r05.IELM.pm-cpu_gnu.elm-topounit.wnid004324ed                                               fail               RUN nodes=   1 mins=  2.0 state=    FAILED  notes=SNICAR ERROR: negative absoption
SMS.r05_r05.IELM.pm-cpu_intel.elm-topounit.wnid004324ed                                             fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_D_Ld1.TL319_IcoswISC30E3r5.DTESTM-JRA1p5.pm-cpu_gnu.mpassi-jra_1958.wnid004324ed                pass                   nodes=   1 mins=  3.6 state= COMPLETED  notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.DTESTM-JRA1p5.pm-cpu_intel.mpassi-jra_1958.wnid004324ed              pass                   nodes=   1 mins=  6.9 state= COMPLETED  notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.GMPAS-JRA1p5-DIB-PISMF.pm-cpu_gnu.mpaso-jra_1958.wnid004324ed        pass                   nodes=   1 mins=  7.8 state= COMPLETED  notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.GMPAS-JRA1p5-DIB-PISMF.pm-cpu_intel.mpaso-jra_1958.wnid004324ed      pass                   nodes=   1 mins= 17.3 state= COMPLETED  notes=
SMS_D_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_rd.wnid004324ed                                   pass                   nodes=   1 mins=  2.3 state= COMPLETED  notes=
SMS_D_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_rd.wnid004324ed                                 pass                   nodes=   1 mins=  3.7 state= COMPLETED  notes=
SMS_D_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed                                               pass                   nodes=   1 mins=  1.2 state= COMPLETED  notes=
SMS_D_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed                                             pass                   nodes=   1 mins=  1.2 state= COMPLETED  notes=
SMS_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_eca.wnid004324ed                                    pass                   nodes=   1 mins=  1.6 state= COMPLETED  notes=
SMS_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_eca.wnid004324ed                                  fail               RUN nodes=   1 mins=  0.9 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ld5_PS.f19_g16.IELMFATES.pm-cpu_gnu.elm-fates_cold.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.2 state=    FAILED  notes=SNICAR ERROR: negative absoption
SMS_Ld5_PS.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.wnid004324ed                               fail               RUN nodes=   1 mins=  1.1 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_pg2.wnid004324ed                                 fail               RUN nodes=   1 mins=  1.9 state=    FAILED  notes=bad state in EOS
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2.wnid004324ed                              pass                   nodes=   1 mins=  1.6 state= COMPLETED  notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2_ftype0.wnid004324ed                       pass                   nodes=   1 mins=  1.0 state= COMPLETED  notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed                                                 pass                   nodes=   1 mins=  1.7 state= COMPLETED  notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_pg2.wnid004324ed                               fail               RUN nodes=   1 mins=  0.7 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2.wnid004324ed                            fail               RUN nodes=   1 mins=  1.1 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2_ftype0.wnid004324ed                     fail               RUN nodes=   1 mins=  0.5 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed                                               fail               RUN nodes=   1 mins=  0.4 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln9.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-outfrq9s.wnid004324ed                                    pass                   nodes=   1 mins=  1.1 state= COMPLETED  notes=
SMS_Ln9.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-outfrq9s.wnid004324ed                                  fail               RUN nodes=   1 mins=  0.7 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_gnu.wnid004324ed                                       pass                   nodes=   1 mins=  1.1 state= COMPLETED  notes=
SMS_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_intel.wnid004324ed                                     fail               RUN nodes=   1 mins=  0.8 state=    FAILED  notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-fan.wnid004324ed                            pass                   nodes=   1 mins=  7.2 state= COMPLETED  notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-force_netcdf_pio.wnid004324ed               pass                   nodes=   1 mins=  7.1 state= COMPLETED  notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-per_crop.wnid004324ed                       pass                   nodes=   1 mins=  7.2 state= COMPLETED  notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-fan.wnid004324ed                          pass                   nodes=   1 mins=  6.2 state= COMPLETED  notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-force_netcdf_pio.wnid004324ed             pass                   nodes=   1 mins=  6.5 state= COMPLETED  notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-per_crop.wnid004324ed                     pass                   nodes=   1 mins=  6.1 state= COMPLETED  notes=
SMS_Ly2_P1x1_D.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-lulcc_sville.wnid004324ed                 pass                   nodes=   1 mins=  9.7 state= COMPLETED  notes=
SMS_Ly2_P1x1_D.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-lulcc_sville.wnid004324ed               pass                   nodes=   1 mins= 11.6 state= COMPLETED  notes=
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_gnu.allactive-mach_mods.wnid004324ed                     fail               RUN nodes=   1 mins=  0.6 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods.wnid004324ed                   fail               RUN nodes=   1 mins=  0.5 state=    FAILED  notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_R_Ld5.ne4_ne4.FSCM-ARM97.pm-cpu_gnu.eam-scm.wnid004324ed                                        pass                   nodes=   1 mins=  0.8 state= COMPLETED  notes=
SMS_R_Ld5.ne4_ne4.FSCM-ARM97.pm-cpu_intel.eam-scm.wnid004324ed                                      pass                   nodes=   1 mins=  0.8 state= COMPLETED  notes=

lo.txt

For example, with test ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel, it does fail, but not obvious why.

  2: MPICH ERROR [Rank 2] [job id 31613879.0] [Tue Oct  8 23:00:23 2024] [nid004324] - Abort(1) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
  2: 
  2: aborting job:
  2: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
srun: error: nid004324: task 2: Exited with exit code 255
srun: Terminating StepId=31613879.0
  0: slurmstepd: error: *** STEP 31613879.0 ON nid004324 CANCELLED AT 2024-10-09T06:00:24 ***
  0: forrtl: error (78): process killed (SIGTERM)
  0: Image              PC                Routine            Line        Source             
  0: libpthread-2.31.s  00001479B7FEF910  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B9F6FB46  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B8CF9EE9  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B989B926  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B989FE29  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B97E55AA  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B831F8FC  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B9A1E700  Unknown               Unknown  Unknown
  0: libmpi_intel.so.1  00001479B81BE27C  PMPI_Allreduce        Unknown  Unknown
  0: libmpigf.so.4      00001479BA843856  mpi_allreduce_        Unknown  Unknown
  0: e3sm.exe           000000000152B3DD  mpas_dmpar_mp_mpa         783  mpas_dmpar.f90
  0: e3sm.exe           000000000085C835  seaice_error_mp_s         114  mpas_seaice_error.f90
  0: e3sm.exe           00000000007A7105  seaice_icepack_mp        2073  mpas_seaice_icepack.f90
  0: e3sm.exe           000000000078F1CE  seaice_icepack_mp        1067  mpas_seaice_icepack.f90
  0: e3sm.exe           0000000000637773  seaice_time_integ         151  mpas_seaice_time_integration.f90
  0: e3sm.exe           0000000000559442  ice_comp_mct_mp_i        1163  ice_comp_mct.f90
  0: e3sm.exe           000000000045E8CE  component_mod_mp_         757  component_mod.F90
  0: e3sm.exe           00000000004380D9  cime_comp_mod_mp_        2951  cime_comp_mod.F90
  0: e3sm.exe           000000000045E562  MAIN__                    153  cime_driver.F90
ambrad commented 1 month ago

Re: ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel, does the MPAS seaice log show anything? There might also be MPAS error files that give details. I base this guess on the stack trace you posted.

ndkeen commented 1 month ago

Ah yep, I thought I checked. Indeed that case has a log seaice error:

ERROR:  (picard_nonconvergence)-------------------------------------
ERROR:  (picard_nonconvergence)picard convergence failed!
ERROR:  (picard_nonconvergence)           0  -21.8443537466552       -22.1019399998646
ERROR:  (picard_nonconvergence)           1  -4.84682916474659       -21.9086932337995       -125428584.244632       -113507475.175262
ERROR:  (picard_nonconvergence)           2  -4.84682916474659       -21.5262394677792       -125161364.142522       -113507475.175262
ERROR:  (picard_nonconvergence)           3  -4.84682916474659       -21.1477349136505       -124896903.351206       -113507475.175262
ERROR:  (picard_nonconvergence)           4  -4.84682916474659       -20.7730899522568       -124635139.253861       -113507475.175262
ERROR:  (picard_nonconvergence)           5  -4.84682916474659       -20.4022189507429       -124376012.018887       -113507475.175262
ERROR:  (picard_nonconvergence)           1  -18.5703089534524       -18.8159095325175       0.366184490084448       0.366184490084448       1.808809718688001E-003  -342054634.279611       -341576832.929071
ERROR:  (picard_nonconvergence)           2  -15.8298794934608       -16.0762390558475        1.48618427448318        1.48618427448318       8.074397048395444E-003  -335024885.309945       -334542275.786110
ERROR:  (picard_nonconvergence)           3  -13.1554654273665       -13.3969516893771        2.36552601519627        2.36552601519627       1.431183578826059E-002  -328047941.626283       -327571627.681340
ERROR:  (picard_nonconvergence)           4  -10.5490550853654       -10.7786000882985        2.87146716085510        2.87146716085510       1.964945227506791E-002  -321374283.764522       -320918878.605515
ERROR:  (picard_nonconvergence)           5  -7.93900204014272       -8.21665519859031        3.49976815271111        3.49976815271111       2.776340862842379E-002  -313952523.194647       -313396811.502956
ERROR:  (picard_nonconvergence)           6  -4.83260701839303       -5.68482374194324        6.14450298799563        6.10270310370290       7.494685939048813E-002  -295025826.915040       -293233360.461115
ERROR:  (picard_nonconvergence)           7  -2.11794460353807       -3.08830351383280        12.5529104685063        12.3348506476065       0.333274257883525       -212237560.738040       -209655569.373201
ERROR:  (picard_nonconvergence)-------------------------------------
ERROR: (picard_solver) picard_solver: Picard solver non-convergence
ERROR:  (icepack_warnings_setabort) T
ERROR:   (icepack_warnings_setabort) T :file /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/mpas-framework/src/core_seaice/icepack/columnphysics/icepack_therm_mushy.F90
ERROR:    (icepack_warnings_setabort) T :file /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/mpas-framework/src/core_seaice/icepack/columnphysics/icepack_therm_mushy.F90 :line         1335
ERROR: (icepack_warnings_aborted) ... (picard_solver)
ERROR: (icepack_warnings_aborted) ... (two_stage_solver_snow)
ERROR: (icepack_warnings_aborted) ... (temperature_changes_salinity)
ERROR:  (temperature_changes_salinity)temperature_changes_salinity: Picard solver non-convergence (snow)
ERROR: (icepack_warnings_aborted) ... (thermo_vertical)
ERROR: (icepack_warnings_aborted) ... (icepack_step_therm1)
ERROR:  (icepack_step_therm1) ice: Vertical thermo error, cat            1
CRITICAL ERROR: icepack aborted

All of these tests are in /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23, but mixed in with other stuff i was doing. Can probably just look at cases where i labelled them with something like wnid004324

mahf708 commented 1 month ago

Random idea: All gnu cases fail with SNICAR ERROR: negative absoption (also there's a typo in that error message)... could we try to run some of these tests but disabling SNICAR AD to see what happens? The namelist parameter is use_snicar_ad for land (there's snicar stuff in mpassi too, but that is governed by a different namelist parameter, it appears, config_use_snicar_ad)

ndkeen commented 1 month ago

Still debugging this. I learned that even F2010-CICE.ne4pg2_oQU480 will have the issue on the affected node. Looking at log files, I see the earliest difference is with Vth. Printing out areas, they are different on certain cores. I can run the case with 4 or fewer MPI's. I can adjust flags to get the case to not crash, but then when I looked at values, they are not BFB.

I see a gfr%check flag in gfr_init which I set manually to call subroutine check_areas

Normal node:

 gfr> Running with dynamics and physics on separate grids (physgrid).
gfr> init nphys  2 check 1 boost_pg1 F
 gfr> area fv raw   12.5663706143592       1.413579858428230E-016
 gfr> area fv adj  4.240739575284689E-016  0.000000000000000E+000
 gfr> area gll     4.240739575284689E-016

bad node:

 gfr> Running with dynamics and physics on separate grids (physgrid).
gfr> init nphys  2 check 1 boost_pg1 F
 gfr> area fv raw   12.5663706143592       1.413579858428230E-016
 gfr> area fv adj  3.876668690154838E-009  0.000000000000000E+000
 gfr> area gll     3.876668690154838E-009

Drilling down a little more, adding writes in this function:

  function gfr_f_get_area(ie, i, j) result(area)
    ! Get (lat,lon) of FV point i,j.                                                                                                                                                                                                                                                                                                               

    integer, intent(in) :: ie, i, j
    real(kind=real_kind) :: area

    integer :: k

    k = gfr%nphys*(j-1) + i                                                                                                                                                                                                                                                                                                           
    write(*,'(a,i8,es28.15)') "ndk gfr_f_get_area gfr%fv_metdet(k,ie)=", &
         k,  gfr%fv_metdet(k,ie)
    area = gfr%w_ff(k)*gfr%fv_metdet(k,ie)
  end function gfr_f_get_area

a job on bad node is different than on normal node. With 96 MPi's, it's always rank2 that is different. Nothing obviously wrong in the code -- what I've been trying to do is create a simple stand-alone reproducer. I have been unable to do so as the tests always seem fine. Have only seen issue with e3sm app.

ndkeen commented 1 month ago

Did a little more debugging (and trying to make stand-alone reproducer), before I admitted defeat.
I see that it looks like something happens to values in an array when a function is called. But only in optimized build and apparently, only on MPI rank 2 (for the 96-way case I was trying).

NERSC has moved the 4324 node to DEBUG state and will either run some more tests or ask if HPE can.