Open ndkeen opened 5 months ago
I see this again with ERS_Ld30.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_satphen
on next of Aug27th test
I also ran into this error twice while running an v3.HR F2010 case (out of 6 recent submissions) First case: job id 29807591
1442: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 17311 and t=
1442: 1
1442: sum is: 2.00000068306984
1442: ENDRUN:
1442: ERROR in surfrdUtilsMod.F90 at line 75
Second case: job id 29997275.240831
28750: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 345011 and t=
28750: 1
28750: sum is: 1.83000000902540
28750: ENDRUN:
28750: ERROR in surfrdUtilsMod.F90 at line 75
/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015_plus4K.ne120pg2_r025_icos30.oro_conv_gw_tunings.pm-cpu/run
NDK: For these 2 jobids, the first contains nid004324
, the second does not
And here are the ne512 land-only spin-up runs that encountered the same error. The failure occurs randomly (resubmit, sometimes more than once, can overcome, and in repeated failures, the reported error can occur at different column with different sum). All occur during initialization. The first number after PID is the process element id, as seen in e3sm.log. The runs were using pm-cpu.
28289399.240719-160327 28746: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1654999
28490072.240724-142113 1154: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 66493
29745228.240827-193708 15562: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 895981
29906097.240829-064757 1282: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 73863
29425140.240816-021759 15818: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1819771
NDK: It was most useful for Wuyin to include job ids here.
Of those 5 above, 28490072 and 29906097 included the potential bad node.
where 28289399,29745228,29425140 did not
Minor update: It is starting to look like this issue (as well as other similar one linked) may not have a testcase dependency (ie compset/res). I don't think it will be easy to reproduce these with the smaller tests, as the frequency seems rare. But as Wuyin reports several of these with this setup, there might be a greater chance of hitting error with this (possibly simply due to larger number of MPI's?). I could either try to reproduce what Wuyin is doing, or simply run the cases above with more MPI's.
I'm also trying to update the intel compiler (with other module version changes) in https://github.com/E3SM-Project/E3SM/pull/6596 so I will try a few tests with that (but again, if frequency is rare, may not be easy to test if this has any impact at all).
With next of 9/15, the following tests hit this error:
ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface
ERS_Vmoab.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-mach-pet (case2run)
As we had several tests hit this error (normally 0, every now and then 1), I tried to see if I could repeat with one of the 1-node tests above ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics
. I launched 32 of these tests with a Sep6th checkout of next and another 32 with my branch to update intel compiler. All of them passed (where as it's ERS, each test goes thru init twice, so that's 128 passes thru init without issue).
I also tried several tests with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
as well as same test using more tasks ERS_P1024.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
. All passing so far.
About 24 cases with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
on next of Sept6th
Another 24 cases with ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
using newer Intel.
And then about 15 cases with ERS_P1024.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way
using Sep6th next
and about 10 cases of same test using newer Intel compiler.
All passing.
Just to document here that another submission of an F2010 case at ne120 ran into the error:
/pscratch/sd/t/terai/E3SMv3_dev/20240823.v3.F2010-TMSOROC05-Z0015.ne120pg2_r025_icos30.Nomassfluxadj.pm-cpu/run/e3sm.log.30555710.240918-042651
122: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1475 and t=
122: 1
122: sum is: 1.93000027398218
122: ENDRUN:
122: ERROR in surfrdUtilsMod.F90 at line 75
122:
122:
122:
122:
122:
122:
122: ERROR: Unknown error submitted to shr_abort_abort.
NDK:noting this job includes the 4324 bad node
Note that this appears to be more than one type of errors.
The issue was created for incidents of surfrd_veg_all ERROR: sum of wt_cft not 1.0
Recently with higher resolution, we mainly saw the occurrence of sum of wt_nat_patch not 1.0
at high frequency.
Back tracing is the same -- calling the same check_sums routine for different fields.
Thanks. Wuyin also indicated that he is using a version of code that includes the update to Intel compiler version.
Since we updated this (~Sep19th), I've not seen any more error of this sort on cdash -- and I've been running quite a few benchmarks jobs on pm-cpu (and muller-cpu, almost identical) with updated compiler version -- no errors like this yet.
I certainly didn't think the compiler version would "fix" it.
I ran into similar errors with the compset F20TR and resolution "ne30pg2_r05_IcoswISC30E3r5"
surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 3305 and t=1
sum is: 0.967455393474239
ENDRUN:
ERROR in surfrdUtilsMod.F90 at line 75
ERROR: Unknown error submitted to shr_abort_abort.
surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 93107 and t= 1
sum is: 0.000000000000000E+000
ENDRUN:
ERROR in surfrdUtilsMod.F90 at line 75
ERROR: Unknown error submitted to shr_abort_abort.
surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 94312 and t= 1
sum is: 0.000000000000000E+000
ENDRUN:
ERROR in surfrdUtilsMod.F90 at line 75
ERROR: Unknown error submitted to shr_abort_abort.
Any clue to solve this issue?
I actually had two 270-node F cases that failed. One of each variety:
962: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 11555 and t=
962: 1
962: sum is: 1.59000001978980
962: ENDRUN:
962: ERROR in surfrdUtilsMod.F90 at line 75
and
9728: ERROR: sum of areas on globe does not equal 4*pi
Case: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23/f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb
jobid: 31201149
which does include the potential bad node 4324
Then while testing a potential fix I found to a different issue in init that I've been struggling with, I have seen two passes with this same 270-node setup. Certainly not conclusive, but this is easy/safe thing to try.
OK case:
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23/f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb.base.barr
The potential fix/hack of adding MPI_Barrier before a certain MPI_AllReduce described here: https://github.com/E3SM-Project/E3SM/issues/6655
With testing on muller-cpu, I've actually been unable to reproduce these errors (of not summing to 1.0) -- the only issue I've had so far are stalls/hangs. I've run 300-400 cases at different resolution/node-counts.
Wuyin gave me a land-only launch script that he had been using recently on pm-cpu and encountering the error noted above more frequently. I tried on muller-cpu.
readonly COMPSET="2010_DATM%ERA56HR_ELM%CNPRDCTCBCTOP_SICE_SOCN_MOSART_SGLC_SWAV"
readonly RESOLUTION="ne120pg2_r025_RRSwISC6to18E3r5"
I ran it with 8,16,32,64,128,256 nodes and have yet to see a fail of the sort we see on pm-cpu.
I'm not yet sure what this means -- seems low percentage that the slingshot changes on muller-cpu could be impacting this.
However, I do see hangs in init at 256 nodes. The Barrier hack noted above does not seem to fix, but the libfabric setting does allow it to run ok every time (so far). This might be enough evidence to say the hanging-in-init issue is just different than sum-of-values-not-always-1 issue.
With my testing on muller-cpu (which again is using newer slighshot SW coming soon to pm-cpu), I'm finding that:
a) everything works as before (for e3sm/scream) at various resolutions. no obvious measurable perf diff either.
b) there are some cases (F/I cases at ne120) that are hanging at higher node counts
c) for all cases that hang, using FI_MR_CACHE_MONITOR=kdreg2
seems to resolve
d) there may be other work-arounds we can use to prevent needing this env var, but have not worked in all situations
Now this might not even be related to this current issue above. It's just something we should consider trying even now on pm-cpu. Can add this to config_machines.xml
to affect new cases:
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
There might be a small perf impact with kdreg2 or maybe nothing -- very similar timing.
I can try to better describe what this is doing, but my understanding is that it's something newer HPE is working on.
Just adding more info on that env var here for completeness:
The default is FI_MR_CACHE_MONITOR=userfaultfd
# Define a default memory registration monitor. The monitor checks for virtual to physical memory address changes. Options are: kdreg2, memhooks, userfaultfd and disabled. Kdreg2 is supplied as a loadable Linux kernel module. Memhooks operates by intercepting memory allocation and free calls. Userfaultfd is a Linux kernel feature. 'memhooks' is the default if available on the system. The 'disabled' option disables memory caching.
Danqing had same error with F-case. jobid: 30483554 https://pace.ornl.gov/exp-details/191658
As with the other issue, I do see that it looks like there is at least one "bad node" on pm-cpu. If specifically ask for nid004324
, I see these errors for these test cases:
ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics.sus/run/e3sm.log.31574808.241008-030200: 2: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1884 and t=
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface.sus/run/e3sm.log.31574809.241008-030122: 2: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 493 and t=
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way.sus/run/e3sm.log.31574861.241008-030947: 2: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 2229 and t=
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-
jobids:
31574808, 31574809, 31574861
Note this compute node was not used in some of the other failing jobs above.
I think I have found the other bad node.
I submit we will always see this error if either of these 2 nodes are used. And will not see crash if they are avoided:
nid006855
nid004324
To submit a job that will avoid these 2: case.submit -a="-x nid004324,nid006855"
Working with NERSC now and they have removed 4324 from pool, but letting me test on it.
Testing on the 4324 node, I have a learned a few things:
1) Intel optimize cases affected will always fail in same way -- still trying to learn what types of cases are affected (for example this case does not fail SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP
)
2) GNU optimize cases also fail -- but have different error message (below)
3) With Intel and GNU, the DEBUG cases do not fail
2: SNICAR ERROR: negative absoption : -0.641576E-01 at timestep: 9 at column: 4053
2: SNICAR_AD STATS: snw_rds(0)= 55
2: SNICAR_AD STATS: L_snw(0)= 3.3222098821765529E-002
2: SNICAR_AD STATS: h2osno= 3.3222098821765529E-002 snl= -1
2: SNICAR_AD STATS: soot1(0)= 0.0000000000000000
2: SNICAR_AD STATS: soot2(0)= 0.0000000000000000
2: SNICAR_AD STATS: dust1(0)= 0.0000000000000000
2: SNICAR_AD STATS: dust2(0)= 0.0000000000000000
2: SNICAR_AD STATS: dust3(0)= 0.0000000000000000
2: SNICAR_AD STATS: dust4(0)= 0.0000000000000000
2: calling getglobalwrite with decomp_index= 4053 and elmlevel= column
2: local column index = 4053
2: global column index = 186402
2: global landunit index = 58074
2: global gridcell index = 16804
2: gridcell longitude = 152.50000000000000
2: gridcell latitude = 58.900523560209386
2: column type = 1
2: landunit type = 1
2: ENDRUN:ERROR in /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SnowSnicarMod.F90 at line 2934 \
2: ERROR: Unknown error submitted to shr_abort_abort.
2: #0 0xd33baa in __shr_abort_mod_MOD_shr_abort_backtrace
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/share/util/shr_abort_mod.F90:104
2: #1 0xd33d80 in __shr_abort_mod_MOD_shr_abort_abort
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/share/util/shr_abort_mod.F90:61
2: #2 0x7f39ed in __snowsnicarmod_MOD_snicar_ad_rt
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SnowSnicarMod.F90:2934
2: #3 0x85067e in __surfacealbedomod_MOD_surfacealbedo
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/biogeophys/SurfaceAlbedoMod.F90:637
2: #4 0x52e9f7 in __elm_driver_MOD_elm_drv
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/main/elm_driver.F90:1376
2: #5 0x516ac9 in __lnd_comp_mct_MOD_lnd_run_mct
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/elm/src/cpl/lnd_comp_mct.F90:617
2: #6 0x48118a in __component_mod_MOD_component_run
2: at /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/driver-mct/main/component_mod.F90:757
2: #7 0x46fb07 in __cime_comp_mod_MOD_cime_run
I ran e3sm_developer only on nid004324
(where only 1 node jobs were allowed) with both intel and gnu. The idea is to verify we always get these fails on this node -- but also to see what type of fails this node might also have been causing. And, is it possible that a case still continues... How would we know?
ERIO.ne30_g16_rx1.A.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 14.3 state= COMPLETED notes=
ERIO.ne30_g16_rx1.A.pm-cpu_intel.wnid004324ed fail COMPARE_netcdf4c_ nodes= 1 mins= 11.6 state= COMPLETED notes=
ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed fail RUN nodes= 1 mins= 2.7 state= FAILED notes=
ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_gnu.elm-usrpft_codetest_I1850CNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 1.1 state= COMPLETED notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_gnu.elm-usrpft_default_I1850CNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 0.9 state= COMPLETED notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_intel.elm-usrpft_codetest_I1850CNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 1.7 state= COMPLETED notes=
ERS.ELM_USRDAT.I1850CNPRDCTCBC.pm-cpu_intel.elm-usrpft_default_I1850CNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 1.3 state= COMPLETED notes=
ERS.ELM_USRDAT.I1850ELM.pm-cpu_gnu.elm-usrdat.wnid004324ed pass nodes= 1 mins= 1.1 state= COMPLETED notes=
ERS.ELM_USRDAT.I1850ELM.pm-cpu_intel.elm-usrdat.wnid004324ed pass nodes= 1 mins= 2.2 state= COMPLETED notes=
ERS.ELM_USRDAT.IELM.pm-cpu_gnu.elm-surface_water_dynamics.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.ELM_USRDAT.IELM.pm-cpu_intel.elm-surface_water_dynamics.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.MOS_USRDAT.RMOSGPCC.pm-cpu_gnu.mosart-mos_usrdat.wnid004324ed pass nodes= 1 mins= 1.4 state= COMPLETED notes=
ERS.MOS_USRDAT.RMOSGPCC.pm-cpu_intel.mosart-mos_usrdat.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 1.4 state= COMPLETED notes=
ERS.MOS_USRDAT.RMOSNLDAS.pm-cpu_gnu.mosart-sediment.wnid004324ed pass nodes= 1 mins= 2.6 state= COMPLETED notes=
ERS.MOS_USRDAT.RMOSNLDAS.pm-cpu_intel.mosart-sediment.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.1 state= COMPLETED notes=
ERS.f09_g16.I1850ELMCN.pm-cpu_gnu.elm-bgcinterface.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.f09_g16.I1850ELMCN.pm-cpu_intel.elm-bgcinterface.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.I1850GSWCNPRDCTCBC.pm-cpu_gnu.elm-vstrd.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.f09_g16.I1850GSWCNPRDCTCBC.pm-cpu_intel.elm-vstrd.wnid004324ed fail RUN nodes= 1 mins= 1.3 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.IELMBC.pm-cpu_gnu.elm-simple_decomp.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.6 state= COMPLETED notes=
ERS.f09_g16.IELMBC.pm-cpu_gnu.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.f09_g16.IELMBC.pm-cpu_intel.elm-simple_decomp.wnid004324ed fail RUN nodes= 1 mins= 1.2 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16.IELMBC.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f09_g16_g.MALISIA.pm-cpu_gnu.wnid004324ed fail RUN nodes= 1 mins= 3.4 state= FAILED notes=
ERS.f09_g16_g.MALISIA.pm-cpu_intel.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 0.8 state= COMPLETED notes=
ERS.f19_f19.I1850ELMCN.pm-cpu_gnu.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 1.8 state= COMPLETED notes=
ERS.f19_f19.I1850ELMCN.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_f19.I20TRELMCN.pm-cpu_gnu.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.0 state= COMPLETED notes=
ERS.f19_f19.I20TRELMCN.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.7 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNECACNTBC.pm-cpu_gnu.elm-eca.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 1.9 state= COMPLETED notes=
ERS.f19_g16.I1850CNECACNTBC.pm-cpu_intel.elm-eca.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNECACTCBC.pm-cpu_gnu.elm-eca.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.3 state= COMPLETED notes=
ERS.f19_g16.I1850CNECACTCBC.pm-cpu_intel.elm-eca.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850CNRDCTCBC.pm-cpu_gnu.elm-rd.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 1.8 state= COMPLETED notes=
ERS.f19_g16.I1850CNRDCTCBC.pm-cpu_intel.elm-rd.wnid004324ed fail RUN nodes= 1 mins= 0.7 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 13.6 state= COMPLETED notes=
ERS.f19_g16.I1850ELM.pm-cpu_gnu.elm-vst.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.5 state= COMPLETED notes=
ERS.f19_g16.I1850ELM.pm-cpu_intel.elm-betr.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850ELM.pm-cpu_intel.elm-vst.wnid004324ed fail RUN nodes= 1 mins= 1.2 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I1850GSWCNPECACNTBC.pm-cpu_gnu.elm-eca_f19_g16_I1850GSWCNPECACNTBC.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.3 state= COMPLETED notes=
ERS.f19_g16.I1850GSWCNPECACNTBC.pm-cpu_intel.elm-eca_f19_g16_I1850GSWCNPECACNTBC.wnid004324ed fail RUN nodes= 1 mins= 1.1 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I20TRGSWCNPECACNTBC.pm-cpu_gnu.elm-eca_f19_g16_I20TRGSWCNPECACNTBC.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.2 state= COMPLETED notes=
ERS.f19_g16.I20TRGSWCNPECACNTBC.pm-cpu_intel.elm-eca_f19_g16_I20TRGSWCNPECACNTBC.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.I20TRGSWCNPRDCTCBC.pm-cpu_gnu.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC.wnid004324ed fail RUN nodes= 1 mins= 1.0 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.f19_g16.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC.wnid004324ed fail RUN nodes= 1 mins= 0.7 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.IERA56HRELM.pm-cpu_gnu.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.5 state= COMPLETED notes=
ERS.f19_g16.IERA56HRELM.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16.IERA5ELM.pm-cpu_gnu.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.2 state= COMPLETED notes=
ERS.f19_g16.IERA5ELM.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 1.1 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.1 state= COMPLETED notes=
ERS.f19_g16_rx1.A.pm-cpu_intel.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.3 state= COMPLETED notes=
ERS.ne30_g16_rx1.A.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.2 state= COMPLETED notes=
ERS.ne30_g16_rx1.A.pm-cpu_intel.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.2 state= COMPLETED notes=
ERS.r05_r05.ICNPRDCTCBC.pm-cpu_gnu.elm-cbudget.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 13.2 state= COMPLETED notes=
ERS.r05_r05.ICNPRDCTCBC.pm-cpu_intel.elm-cbudget.wnid004324ed fail RUN nodes= 1 mins= 1.1 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_cft not 1.0
ERS.r05_r05.IELM.pm-cpu_gnu.elm-V2_ELM_MOSART_features.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.r05_r05.IELM.pm-cpu_gnu.elm-lnd_rof_2way.wnid004324ed fail RUN nodes= 1 mins= 2.4 state= FAILED notes=SNICAR ERROR: negative absoption
ERS.r05_r05.IELM.pm-cpu_intel.elm-V2_ELM_MOSART_features.wnid004324ed fail RUN nodes= 1 mins= 1.0 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_cft not 1.0
ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way.wnid004324ed fail RUN nodes= 1 mins= 1.6 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS.r05_r05.RMOSGPCC.pm-cpu_gnu.mosart-gpcc_1972.wnid004324ed pass nodes= 1 mins= 1.9 state= COMPLETED notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_gnu.mosart-heat.wnid004324ed pass nodes= 1 mins= 2.0 state= COMPLETED notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_intel.mosart-gpcc_1972.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.5 state= COMPLETED notes=
ERS.r05_r05.RMOSGPCC.pm-cpu_intel.mosart-heat.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 1.8 state= COMPLETED notes=
ERS_D.f09_f09.IELM.pm-cpu_gnu.elm-koch_snowflake.wnid004324ed pass nodes= 1 mins= 2.9 state= COMPLETED notes=
ERS_D.f09_f09.IELM.pm-cpu_gnu.elm-solar_rad.wnid004324ed pass nodes= 1 mins= 4.0 state= COMPLETED notes=
ERS_D.f09_f09.IELM.pm-cpu_intel.elm-koch_snowflake.wnid004324ed pass nodes= 1 mins= 4.2 state= COMPLETED notes=
ERS_D.f09_f09.IELM.pm-cpu_intel.elm-solar_rad.wnid004324ed pass nodes= 1 mins= 4.4 state= COMPLETED notes=
ERS_D.f09_g16.I1850ELMCN.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 4.3 state= COMPLETED notes=
ERS_D.f09_g16.I1850ELMCN.pm-cpu_intel.wnid004324ed pass nodes= 1 mins= 8.2 state= COMPLETED notes=
ERS_D.f19_f19.IELM.pm-cpu_gnu.elm-ic_f19_f19_ielm.wnid004324ed pass nodes= 1 mins= 1.7 state= COMPLETED notes=
ERS_D.f19_f19.IELM.pm-cpu_intel.elm-ic_f19_f19_ielm.wnid004324ed pass nodes= 1 mins= 2.6 state= COMPLETED notes=
ERS_D.f19_g16.I1850GSWCNPRDCTCBC.pm-cpu_gnu.elm-ctc_f19_g16_I1850GSWCNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 2.5 state= COMPLETED notes=
ERS_D.f19_g16.I1850GSWCNPRDCTCBC.pm-cpu_intel.elm-ctc_f19_g16_I1850GSWCNPRDCTCBC.wnid004324ed pass nodes= 1 mins= 3.8 state= COMPLETED notes=
ERS_D.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-hommexx.wnid004324ed pass nodes= 1 mins= 7.8 state= COMPLETED notes=
ERS_D.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-hommexx.wnid004324ed pass nodes= 1 mins= 11.3 state= COMPLETED notes=
ERS_D.ne4pg2_oQU480.I20TRELM.pm-cpu_gnu.elm-disableDynpftCheck.wnid004324ed pass nodes= 1 mins= 1.4 state= COMPLETED notes=
ERS_D.ne4pg2_oQU480.I20TRELM.pm-cpu_intel.elm-disableDynpftCheck.wnid004324ed pass nodes= 1 mins= 1.7 state= COMPLETED notes=
ERS_D_Ld15.f45_g37.IELMFATES.pm-cpu_gnu.elm-fates_cold_treedamage.wnid004324ed pass nodes= 1 mins= 2.7 state= COMPLETED notes=
ERS_D_Ld15.f45_g37.IELMFATES.pm-cpu_intel.elm-fates_cold_treedamage.wnid004324ed pass nodes= 1 mins= 3.5 state= COMPLETED notes=
ERS_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.4 state= COMPLETED notes=
ERS_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 2.2 state= COMPLETED notes=
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2_ftype0.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.0 state= COMPLETED notes=
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
ERS_Ld3.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2_ftype0.wnid004324ed fail RUN nodes= 1 mins= 0.4 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
ERS_Ld30.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_satphen.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=SNICAR ERROR: negative absoption
ERS_Ld30.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_satphen.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld30.f45_g37.IELMFATES.pm-cpu_gnu.elm-fates_cold_sizeagemort.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.0 state= COMPLETED notes=
ERS_Ld30.f45_g37.IELMFATES.pm-cpu_intel.elm-fates_cold_sizeagemort.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
ERS_Ld5.T62_oQU120.CMPASO-NYF.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 2.7 state= COMPLETED notes=
ERS_Ld5.T62_oQU120.CMPASO-NYF.pm-cpu_intel.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.4 state= COMPLETED notes=
ERS_Ld5.T62_oQU240.DTESTM.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.4 state= COMPLETED notes=
ERS_Ld5.T62_oQU240.DTESTM.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 2.5 state= COMPLETED notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 2.2 state= COMPLETED notes=
ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 1.7 state= FAILED notes=
ERS_Ld5.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.pm-cpu_gnu.mpaso-ocn_glcshelf.wnid004324ed fail RUN nodes= 1 mins= 1.6 state= FAILED notes=
ERS_Ld5.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.pm-cpu_intel.mpaso-ocn_glcshelf.wnid004324ed fail RUN nodes= 1 mins= 1.5 state= FAILED notes=
ERS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-cpu_gnu.eam-mmf_crmout.wnid004324ed fail COMPARE_base_rest nodes= 1 mins= 3.8 state= COMPLETED notes=
ERS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-cpu_intel.eam-mmf_crmout.wnid004324ed fail RUN nodes= 1 mins= 0.4 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
NCK.f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.4 state= COMPLETED notes=
NCK.f19_g16_rx1.A.pm-cpu_intel.wnid004324ed pass nodes= 1 mins= 2.9 state= COMPLETED notes=
PEM_Ln5.T62_oQU240wLI.DTESTM.pm-cpu_gnu.wnid004324ed fail COMPARE_base_modp nodes= 1 mins= 2.2 state= COMPLETED notes=
PEM_Ln5.T62_oQU240wLI.DTESTM.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 1.4 state= FAILED notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed fail COMPARE_base_modp nodes= 1 mins= 1.9 state= COMPLETED notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 2.1 state= FAILED notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed fail COMPARE_base_modp nodes= 1 mins= 1.7 state= COMPLETED notes=
PEM_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=
PET_Ln5.T62_oQU240.DTESTM.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.3 state= COMPLETED notes=
PET_Ln5.T62_oQU240.DTESTM.pm-cpu_intel.wnid004324ed fail COMPARE_base_sing nodes= 1 mins= 0.9 state= COMPLETED notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.4 state= COMPLETED notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel.wnid004324ed fail COMPARE_base_sing nodes= 1 mins= 0.8 state= COMPLETED notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.2 state= COMPLETED notes=
PET_Ln5.T62_oQU240wLI.GMPAS-DIB-IAF-PISMF.pm-cpu_intel.wnid004324ed fail COMPARE_base_sing nodes= 1 mins= 1.1 state= COMPLETED notes=
SEQ.f19_g16.X.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 2.5 state= COMPLETED notes=
SEQ.f19_g16.X.pm-cpu_intel.wnid004324ed fail COMPARE_base_seq nodes= 1 mins= 2.6 state= COMPLETED notes=
SMS.MOS_USRDAT.RMOSGPCC.pm-cpu_gnu.mosart-unstructure.wnid004324ed pass nodes= 1 mins= 0.9 state= COMPLETED notes=
SMS.MOS_USRDAT.RMOSGPCC.pm-cpu_intel.mosart-unstructure.wnid004324ed pass nodes= 1 mins= 0.9 state= COMPLETED notes=
SMS.ne30_f19_g16_rx1.A.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 0.9 state= COMPLETED notes=
SMS.ne30_f19_g16_rx1.A.pm-cpu_intel.wnid004324ed pass nodes= 1 mins= 0.7 state= COMPLETED notes=
SMS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-cosplite.wnid004324ed fail RUN nodes= 1 mins= 1.0 state= FAILED notes=
SMS.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-cosplite.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS.r05_r05.I1850ELMCN.pm-cpu_gnu.elm-qian_1948.wnid004324ed fail RUN nodes= 1 mins= 1.2 state= FAILED notes=SNICAR ERROR: negative absoption
SMS.r05_r05.I1850ELMCN.pm-cpu_intel.elm-qian_1948.wnid004324ed fail RUN nodes= 1 mins= 1.2 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS.r05_r05.IELM.pm-cpu_gnu.elm-topounit.wnid004324ed fail RUN nodes= 1 mins= 2.0 state= FAILED notes=SNICAR ERROR: negative absoption
SMS.r05_r05.IELM.pm-cpu_intel.elm-topounit.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_D_Ld1.TL319_IcoswISC30E3r5.DTESTM-JRA1p5.pm-cpu_gnu.mpassi-jra_1958.wnid004324ed pass nodes= 1 mins= 3.6 state= COMPLETED notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.DTESTM-JRA1p5.pm-cpu_intel.mpassi-jra_1958.wnid004324ed pass nodes= 1 mins= 6.9 state= COMPLETED notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.GMPAS-JRA1p5-DIB-PISMF.pm-cpu_gnu.mpaso-jra_1958.wnid004324ed pass nodes= 1 mins= 7.8 state= COMPLETED notes=
SMS_D_Ld1.TL319_IcoswISC30E3r5.GMPAS-JRA1p5-DIB-PISMF.pm-cpu_intel.mpaso-jra_1958.wnid004324ed pass nodes= 1 mins= 17.3 state= COMPLETED notes=
SMS_D_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_rd.wnid004324ed pass nodes= 1 mins= 2.3 state= COMPLETED notes=
SMS_D_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_rd.wnid004324ed pass nodes= 1 mins= 3.7 state= COMPLETED notes=
SMS_D_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.2 state= COMPLETED notes=
SMS_D_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed pass nodes= 1 mins= 1.2 state= COMPLETED notes=
SMS_Ld20.f45_f45.IELMFATES.pm-cpu_gnu.elm-fates_eca.wnid004324ed pass nodes= 1 mins= 1.6 state= COMPLETED notes=
SMS_Ld20.f45_f45.IELMFATES.pm-cpu_intel.elm-fates_eca.wnid004324ed fail RUN nodes= 1 mins= 0.9 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ld5_PS.f19_g16.IELMFATES.pm-cpu_gnu.elm-fates_cold.wnid004324ed fail RUN nodes= 1 mins= 1.2 state= FAILED notes=SNICAR ERROR: negative absoption
SMS_Ld5_PS.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.wnid004324ed fail RUN nodes= 1 mins= 1.1 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_pg2.wnid004324ed fail RUN nodes= 1 mins= 1.9 state= FAILED notes=bad state in EOS
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2.wnid004324ed pass nodes= 1 mins= 1.6 state= COMPLETED notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-thetahy_sl_pg2_ftype0.wnid004324ed pass nodes= 1 mins= 1.0 state= COMPLETED notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.7 state= COMPLETED notes=
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_pg2.wnid004324ed fail RUN nodes= 1 mins= 0.7 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2.wnid004324ed fail RUN nodes= 1 mins= 1.1 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetahy_sl_pg2_ftype0.wnid004324ed fail RUN nodes= 1 mins= 0.5 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln5.ne4pg2_oQU480.F2010.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.4 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln9.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-outfrq9s.wnid004324ed pass nodes= 1 mins= 1.1 state= COMPLETED notes=
SMS_Ln9.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-outfrq9s.wnid004324ed fail RUN nodes= 1 mins= 0.7 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_gnu.wnid004324ed pass nodes= 1 mins= 1.1 state= COMPLETED notes=
SMS_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_intel.wnid004324ed fail RUN nodes= 1 mins= 0.8 state= FAILED notes=surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-fan.wnid004324ed pass nodes= 1 mins= 7.2 state= COMPLETED notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-force_netcdf_pio.wnid004324ed pass nodes= 1 mins= 7.1 state= COMPLETED notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-per_crop.wnid004324ed pass nodes= 1 mins= 7.2 state= COMPLETED notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-fan.wnid004324ed pass nodes= 1 mins= 6.2 state= COMPLETED notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-force_netcdf_pio.wnid004324ed pass nodes= 1 mins= 6.5 state= COMPLETED notes=
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-per_crop.wnid004324ed pass nodes= 1 mins= 6.1 state= COMPLETED notes=
SMS_Ly2_P1x1_D.1x1_smallvilleIA.IELMCNCROP.pm-cpu_gnu.elm-lulcc_sville.wnid004324ed pass nodes= 1 mins= 9.7 state= COMPLETED notes=
SMS_Ly2_P1x1_D.1x1_smallvilleIA.IELMCNCROP.pm-cpu_intel.elm-lulcc_sville.wnid004324ed pass nodes= 1 mins= 11.6 state= COMPLETED notes=
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_gnu.allactive-mach_mods.wnid004324ed fail RUN nodes= 1 mins= 0.6 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods.wnid004324ed fail RUN nodes= 1 mins= 0.5 state= FAILED notes=ERROR: sum of areas on globe does not equal 4*pi
SMS_R_Ld5.ne4_ne4.FSCM-ARM97.pm-cpu_gnu.eam-scm.wnid004324ed pass nodes= 1 mins= 0.8 state= COMPLETED notes=
SMS_R_Ld5.ne4_ne4.FSCM-ARM97.pm-cpu_intel.eam-scm.wnid004324ed pass nodes= 1 mins= 0.8 state= COMPLETED notes=
For example, with test ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel
, it does fail, but not obvious why.
2: MPICH ERROR [Rank 2] [job id 31613879.0] [Tue Oct 8 23:00:23 2024] [nid004324] - Abort(1) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
2:
2: aborting job:
2: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
srun: error: nid004324: task 2: Exited with exit code 255
srun: Terminating StepId=31613879.0
0: slurmstepd: error: *** STEP 31613879.0 ON nid004324 CANCELLED AT 2024-10-09T06:00:24 ***
0: forrtl: error (78): process killed (SIGTERM)
0: Image PC Routine Line Source
0: libpthread-2.31.s 00001479B7FEF910 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B9F6FB46 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B8CF9EE9 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B989B926 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B989FE29 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B97E55AA Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B831F8FC Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B9A1E700 Unknown Unknown Unknown
0: libmpi_intel.so.1 00001479B81BE27C PMPI_Allreduce Unknown Unknown
0: libmpigf.so.4 00001479BA843856 mpi_allreduce_ Unknown Unknown
0: e3sm.exe 000000000152B3DD mpas_dmpar_mp_mpa 783 mpas_dmpar.f90
0: e3sm.exe 000000000085C835 seaice_error_mp_s 114 mpas_seaice_error.f90
0: e3sm.exe 00000000007A7105 seaice_icepack_mp 2073 mpas_seaice_icepack.f90
0: e3sm.exe 000000000078F1CE seaice_icepack_mp 1067 mpas_seaice_icepack.f90
0: e3sm.exe 0000000000637773 seaice_time_integ 151 mpas_seaice_time_integration.f90
0: e3sm.exe 0000000000559442 ice_comp_mct_mp_i 1163 ice_comp_mct.f90
0: e3sm.exe 000000000045E8CE component_mod_mp_ 757 component_mod.F90
0: e3sm.exe 00000000004380D9 cime_comp_mod_mp_ 2951 cime_comp_mod.F90
0: e3sm.exe 000000000045E562 MAIN__ 153 cime_driver.F90
Re: ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel, does the MPAS seaice log show anything? There might also be MPAS error files that give details. I base this guess on the stack trace you posted.
Ah yep, I thought I checked. Indeed that case has a log seaice error:
ERROR: (picard_nonconvergence)-------------------------------------
ERROR: (picard_nonconvergence)picard convergence failed!
ERROR: (picard_nonconvergence) 0 -21.8443537466552 -22.1019399998646
ERROR: (picard_nonconvergence) 1 -4.84682916474659 -21.9086932337995 -125428584.244632 -113507475.175262
ERROR: (picard_nonconvergence) 2 -4.84682916474659 -21.5262394677792 -125161364.142522 -113507475.175262
ERROR: (picard_nonconvergence) 3 -4.84682916474659 -21.1477349136505 -124896903.351206 -113507475.175262
ERROR: (picard_nonconvergence) 4 -4.84682916474659 -20.7730899522568 -124635139.253861 -113507475.175262
ERROR: (picard_nonconvergence) 5 -4.84682916474659 -20.4022189507429 -124376012.018887 -113507475.175262
ERROR: (picard_nonconvergence) 1 -18.5703089534524 -18.8159095325175 0.366184490084448 0.366184490084448 1.808809718688001E-003 -342054634.279611 -341576832.929071
ERROR: (picard_nonconvergence) 2 -15.8298794934608 -16.0762390558475 1.48618427448318 1.48618427448318 8.074397048395444E-003 -335024885.309945 -334542275.786110
ERROR: (picard_nonconvergence) 3 -13.1554654273665 -13.3969516893771 2.36552601519627 2.36552601519627 1.431183578826059E-002 -328047941.626283 -327571627.681340
ERROR: (picard_nonconvergence) 4 -10.5490550853654 -10.7786000882985 2.87146716085510 2.87146716085510 1.964945227506791E-002 -321374283.764522 -320918878.605515
ERROR: (picard_nonconvergence) 5 -7.93900204014272 -8.21665519859031 3.49976815271111 3.49976815271111 2.776340862842379E-002 -313952523.194647 -313396811.502956
ERROR: (picard_nonconvergence) 6 -4.83260701839303 -5.68482374194324 6.14450298799563 6.10270310370290 7.494685939048813E-002 -295025826.915040 -293233360.461115
ERROR: (picard_nonconvergence) 7 -2.11794460353807 -3.08830351383280 12.5529104685063 12.3348506476065 0.333274257883525 -212237560.738040 -209655569.373201
ERROR: (picard_nonconvergence)-------------------------------------
ERROR: (picard_solver) picard_solver: Picard solver non-convergence
ERROR: (icepack_warnings_setabort) T
ERROR: (icepack_warnings_setabort) T :file /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/mpas-framework/src/core_seaice/icepack/columnphysics/icepack_therm_mushy.F90
ERROR: (icepack_warnings_setabort) T :file /global/cfs/cdirs/e3sm/ndk/repos/nexty-sep23/components/mpas-framework/src/core_seaice/icepack/columnphysics/icepack_therm_mushy.F90 :line 1335
ERROR: (icepack_warnings_aborted) ... (picard_solver)
ERROR: (icepack_warnings_aborted) ... (two_stage_solver_snow)
ERROR: (icepack_warnings_aborted) ... (temperature_changes_salinity)
ERROR: (temperature_changes_salinity)temperature_changes_salinity: Picard solver non-convergence (snow)
ERROR: (icepack_warnings_aborted) ... (thermo_vertical)
ERROR: (icepack_warnings_aborted) ... (icepack_step_therm1)
ERROR: (icepack_step_therm1) ice: Vertical thermo error, cat 1
CRITICAL ERROR: icepack aborted
All of these tests are in /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-sep23
, but mixed in with other stuff i was doing. Can probably just look at cases where i labelled them with something like wnid004324
Random idea: All gnu cases fail with SNICAR ERROR: negative absoption
(also there's a typo in that error message)... could we try to run some of these tests but disabling SNICAR AD to see what happens? The namelist parameter is use_snicar_ad for land (there's snicar stuff in mpassi too, but that is governed by a different namelist parameter, it appears, config_use_snicar_ad)
Still debugging this. I learned that even F2010-CICE.ne4pg2_oQU480
will have the issue on the affected node. Looking at log files, I see the earliest difference is with Vth. Printing out areas, they are different on certain cores. I can run the case with 4 or fewer MPI's. I can adjust flags to get the case to not crash, but then when I looked at values, they are not BFB.
I see a gfr%check
flag in gfr_init
which I set manually to call subroutine check_areas
Normal node:
gfr> Running with dynamics and physics on separate grids (physgrid).
gfr> init nphys 2 check 1 boost_pg1 F
gfr> area fv raw 12.5663706143592 1.413579858428230E-016
gfr> area fv adj 4.240739575284689E-016 0.000000000000000E+000
gfr> area gll 4.240739575284689E-016
bad node:
gfr> Running with dynamics and physics on separate grids (physgrid).
gfr> init nphys 2 check 1 boost_pg1 F
gfr> area fv raw 12.5663706143592 1.413579858428230E-016
gfr> area fv adj 3.876668690154838E-009 0.000000000000000E+000
gfr> area gll 3.876668690154838E-009
Drilling down a little more, adding writes in this function:
function gfr_f_get_area(ie, i, j) result(area)
! Get (lat,lon) of FV point i,j.
integer, intent(in) :: ie, i, j
real(kind=real_kind) :: area
integer :: k
k = gfr%nphys*(j-1) + i
write(*,'(a,i8,es28.15)') "ndk gfr_f_get_area gfr%fv_metdet(k,ie)=", &
k, gfr%fv_metdet(k,ie)
area = gfr%w_ff(k)*gfr%fv_metdet(k,ie)
end function gfr_f_get_area
a job on bad node is different than on normal node. With 96 MPi's, it's always rank2 that is different. Nothing obviously wrong in the code -- what I've been trying to do is create a simple stand-alone reproducer. I have been unable to do so as the tests always seem fine. Have only seen issue with e3sm app.
Did a little more debugging (and trying to make stand-alone reproducer), before I admitted defeat.
I see that it looks like something happens to values in an array when a function is called. But only in optimized build
and apparently, only on MPI rank 2 (for the 96-way case I was trying).
NERSC has moved the 4324 node to DEBUG state and will either run some more tests or ask if HPE can.
With this test
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp
, I see the following error:and I'm pretty sure I've seen this same error (with same or similar test) before, which may suggest intermittent issue.
I see Rob also ran into https://github.com/E3SM-Project/E3SM/issues/6192