Open ndkeen opened 1 month ago
After adding a temporary work-around to the GNU issue noted above, I can now run with GNU built exe. And I see that it also suffers same fate -- hangs in what looks like same place.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.allactive-wcprodssp
Also, I can still see the hang without the test modifier. For both intel/gnu
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370
Without DEBUG, the test completes (for both intel/gnu)
Since there appears to be a difference in behavior DEBUG vs OPT, I'm trying a few different things. If I stay with DEBUG but simplify the flags to only use -O -g
, I actually get a diff error. Which if real, might be good to track:
213: SHR_REPROSUM_CALC: Input contains 0.10000E+01 NaNs and 0.00000E+00 INFs on MPI task 213
213: ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
213: #0 0x14891a423372 in ???
213: #1 0x23f19fc in __shr_abort_mod_MOD_shr_abort_backtrace
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
213: #2 0x23f1b83 in __shr_abort_mod_MOD_shr_abort_abort
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
213: #3 0x24361d5 in __shr_reprosum_mod_MOD_shr_reprosum_calc
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_reprosum_mod.F90:644
213: #4 0xc6f638 in __global_norms_mod_MOD_wrap_repro_sum
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/global_norms_mod.F90:864
213: #5 0xcc31e5 in __prim_state_mod_MOD_prim_printstate
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/prim_state_mod.F90:216
213: #6 0xc8a5e3 in __prim_driver_base_MOD_prim_init2
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/prim_driver_base.F90:1033
213: #7 0xf3a909 in __dyn_comp_MOD_dyn_init2
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dyn_comp.F90:380
213: #8 0xc352fe in __inital_MOD_cam_initial
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:73
213: #9 0x520eb3 in __cam_comp_MOD_cam_init
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
213: #10 0x51aad1 in __atm_comp_mct_MOD_atm_init_mct
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
213: #11 0x489151 in __component_mod_MOD_component_init_cc
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
213: #12 0x477ef1 in __cime_comp_mod_MOD_cime_init
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
213: #13 0x4866dc in cime_driver
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
213: #14 0x4866dc in main
213: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23
213: MPICH ERROR [Rank 213] [job id 692934.0] [Wed May 29 16:50:38 2024] [nid200068] - Abort(1001) (rank 213 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 213
I also try running with OPT, but without -O2
which completed.
This was all with gnu using a test like SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu
Adjusting compiler flags, I was able to get a stack trace -- which may or may not be same issue.
391: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
391:
391: Backtrace for this error:
391: #0 0x145ddf423372 in ???
391: #1 0x145ddf422505 in ???
391: #2 0x145dde851dbf in ???
391: #3 0xcb15ec in __eos_MOD_pnh_and_exner_from_eos2
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:121
391: #4 0xcb238f in __eos_MOD_pnh_and_exner_from_eos
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
391: #5 0xcaea84 in __element_ops_MOD_tests_finalize
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
391: #6 0xcb068f in __element_ops_MOD_set_thermostate
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
391: #7 0xf32265 in __inidat_MOD_read_inidat
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
391: #8 0xd3684e in __startup_initialconds_MOD_initial_conds
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
391: #9 0xc34dd7 in __inital_MOD_cam_initial
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67
391: #10 0x5209c9 in __cam_comp_MOD_cam_init
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
391: #11 0x51a5e7 in __atm_comp_mct_MOD_atm_init_mct
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
391: #12 0x48903b in __component_mod_MOD_component_init_cc
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
391: #13 0x477dd1 in __cime_comp_mod_MOD_cime_init
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
391: #14 0x4865c6 in cime_driver
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
391: #15 0x4865c6 in main
391: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23
components/homme/src/theta-l/share/eos.F90
! check for bad state that will crash exponential function below
if (theta_hydrostatic_mode) then
ierr= any(dp3d(:,:,:) < 0 ) ! <-- line 121
else
ierr= any(vtheta_dp(:,:,:) < 0 ) .or. &
any(dp3d(:,:,:) < 0 ) .or. &
any(dphi(:,:,:) > 0 )
endif
With a slightly diff flag variation I see this error:
2: bad state in EOS, called from: not specified
2: bad i,j,k= 1 4 42
2: vertical column: dphi,dp3d,vtheta_dp
2: 1 NaN 4.7223 15234.5361
2: 2 NaN 6.9384 21262.8696
2: 3 NaN 10.1644 28133.1975
2: 4 NaN 14.8259 35098.1232
2: 5 NaN 21.4903 46351.4278
2: 6 NaN 30.8760 62234.7248
2: 7 NaN 43.8245 79485.2115
2: 8 NaN 61.2040 99465.7967
2: 9 NaN 83.7154 122918.9372
2: 10 NaN 111.5971 147073.4247
2: 11 NaN 144.2831 166819.3864
2: 12 NaN 180.1535 182548.8554
...
2: ERROR: EOS bad state: d(phi), dp3d or vtheta_dp < 0
2: #0 0x1470f0c23372 in ???
2: #1 0x23c0c04 in __shr_abort_mod_MOD_shr_abort_backtrace
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
2: #2 0x23c0d8b in __shr_abort_mod_MOD_shr_abort_abort
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
2: #3 0x57673b in __cam_abortutils_MOD_endrun
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/utils/cam_abortutils.F90:60
2: #4 0xc7f2c5 in __parallel_mod_MOD_abortmp
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/parallel_mod.F90:278
2: #5 0xcb1923 in __eos_MOD_pnh_and_exner_from_eos2
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:140
2: #6 0xcb2325 in __eos_MOD_pnh_and_exner_from_eos
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
2: #7 0xcaea1a in __element_ops_MOD_tests_finalize
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
2: #8 0xcb0625 in __element_ops_MOD_set_thermostate
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
2: #9 0xf321fb in __inidat_MOD_read_inidat
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
2: #10 0xd367e4 in __startup_initialconds_MOD_initial_conds
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
2: #11 0xc34d6d in __inital_MOD_cam_initial
2: at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67
/mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/mullerup/SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.NDEBUG-Og/run/e3sm.log.692939.240529-174020
I've been adjusting compiler flags in attempt to debug. With -g -O -DNDEBUG -ffpe-trap=invalid,zero
, I'm able to run a DEBUG case to get a quick FPE. Adding checks on arrays higher up, I see there are issues with the data just after it gets read in from file. The test on that data does not catch an issue as there is a mask involved. So I don't know if the issue is with the data, the mask, or how the rest of the code expects/uses the data.
components/eam/src/dynamics/se/inidat.F90
fieldname = 'PS'
tmp(:,:,:) = 0.0_r8 ! ndk try further init
tmp(:,1,:) = 0.0_r8
call t_startf('read_inidat_infld')
if (.not. scm_multcols) then
call infld(fieldname, ncid_ini, ncol_name, &
1, npsq, 1, nelemd, tmp(:,1,:), found, gridname=grid_name)
else
call infld(fieldname, ncid_ini, ncol_name, &
1, 1, 1, 1, tmp(:,1,:), found, gridname=grid_name)
endif
call t_stopf('read_inidat_infld')
if(.not. found) then
call endrun('Could not find PS field on input datafile')
end if
! Check read-in data to make sure it is in the appropriate units
allocate(tmpmask(npsq,nelemd))
tmpmask = (reshape(ldof, (/npsq,nelemd/)) /= 0)
if(minval(tmp(:,1,:), mask=tmpmask) < 10000._r8 .and. .not. scm_multcols) then
call endrun('Problem reading ps field')
end if
ierr= any(tmp(1,1,:) < 0.0 ) !ndk this test will catch FPE's that will later have issues
Trying to update module versions on pm-cpu, but I have hit a few issues. One with intel is that this test hangs in init.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp
I'm noting the hang in HOMME, but as I don't know root cause, it may not actually be issue there. The test works with current intel version (intel/2023.1.0
) and what I'd like to use is the new default for the machine (intel/2023.2.0
)We see this in cpl.log (to indicate still in init):
Looking at where the stack is on compute node:
Above, I pasted results from running on muller-cpu, but I can see same behavior on pm-cpu (just need to update the module versions).
I made a copy of the case on PSCRATCH in case someone wanted to look at logs:
I would like to try this test with other compilers, but we currently have a segfault with GNU https://github.com/E3SM-Project/E3SM/issues/6428