E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
339 stars 343 forks source link

Hang in `dp_coupling::d_p_coupling` with newer module versions and compilers (Intel version 2023.2.0, GNU version 12.3) #6451

Open ndkeen opened 1 month ago

ndkeen commented 1 month ago

Trying to update module versions on pm-cpu, but I have hit a few issues. One with intel is that this test hangs in init. SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp I'm noting the hang in HOMME, but as I don't know root cause, it may not actually be issue there. The test works with current intel version (intel/2023.1.0) and what I'd like to use is the new default for the machine (intel/2023.2.0)

We see this in cpl.log (to indicate still in init):

(seq_mct_drv) : Calling atm_init_mct phase 2
(component_init_cc:mct) : Initialize component atm

Looking at where the stack is on compute node:

#0  cxi_eq_peek_event (eq=0x22e12dc8) at /usr/include/cxi_prov_hw.h:1531
#1  cxip_ep_ctrl_eq_progress (ep_obj=0x22e25790, ctrl_evtq=0x22e12dc8, tx_evtq=true, ep_obj_locked=true) at prov/cxi/src/cxip_ctrl.c:318
#2  0x00001503828591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#3  0x000015038285e969 in cxip_util_cq_progress (util_cq=0x22e15220) at prov/cxi/src/cxip_cq.c:112
#4  0x000015038283a301 in ofi_cq_readfrom (cq_fid=0x22e15220, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#5  0x00001503860fa0f2 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6  0x0000150386c9b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7  0x0000150386ca7685 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8  0x0000150386bd232d in MPIR_Alltoall_intra_brucks () from /opt/cray/pe/lib64/libmpi_intel.so.12
#9  0x00001503855bee8a in MPIR_Alltoall_intra_auto.part.0 () from /opt/cray/pe/lib64/libmpi_intel.so.12
#10 0x00001503855bf05c in MPIR_Alltoall_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#11 0x00001503855bf83f in PMPI_Alltoall () from /opt/cray/pe/lib64/libmpi_intel.so.12
#12 0x0000150387c4364e in pmpi_alltoall__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#13 0x0000000000bcad8f in mpialltoallint (sendbuf=..., sendcnt=1, recvbuf=..., recvcnt=1, comm=-1006632954) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/wrap_mpi.F90:1143
#14 0x0000000002b93c02 in phys_grid::transpose_block_to_chunk (record_size=88, block_buffer=<error reading variable: value requires 2509056 bytes, which is more than max-value-size>, chunk_buffer=<error reading variable: value requires 2452032 bytes, which is more than max-value-size>,
    window=<error reading variable: Cannot access memory at address 0x0>) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/physics/cam/phys_grid.F90:4137
#15 0x0000000005304965 in dp_coupling::d_p_coupling (phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dp_coupling.F90:242
#16 0x0000000003719020 in stepon::stepon_run1 (dtime_out=1800, phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_in=..., dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/stepon.F90:244
#17 0x0000000000948d7c in cam_comp::cam_run1 (cam_in=..., cam_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:251
#18 0x0000000000905530 in atm_comp_mct::atm_init_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=..., nlfilename=..., .tmp.NLFILENAME.len_V$5bab=6) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:499
#19 0x00000000004a7045 in component_mod::component_init_cc (eclock=..., comp=..., infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., .tmp.NLFILENAME.len_V$7206=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$7209=4096, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$720c=4096)
    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:257
#20 0x000000000045d9d6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:2370
#21 0x000000000049dfc2 in cime_driver () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122

Above, I pasted results from running on muller-cpu, but I can see same behavior on pm-cpu (just need to update the module versions).

I made a copy of the case on PSCRATCH in case someone wanted to look at logs:

/pscratch/sd/n/ndk/SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp.20240529_092039_q82rr

I would like to try this test with other compilers, but we currently have a segfault with GNU https://github.com/E3SM-Project/E3SM/issues/6428

ndkeen commented 1 month ago

After adding a temporary work-around to the GNU issue noted above, I can now run with GNU built exe. And I see that it also suffers same fate -- hangs in what looks like same place. SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.allactive-wcprodssp

Also, I can still see the hang without the test modifier. For both intel/gnu

SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370

Without DEBUG, the test completes (for both intel/gnu)

ndkeen commented 1 month ago

Since there appears to be a difference in behavior DEBUG vs OPT, I'm trying a few different things. If I stay with DEBUG but simplify the flags to only use -O -g, I actually get a diff error. Which if real, might be good to track:

213: SHR_REPROSUM_CALC: Input contains  0.10000E+01 NaNs and  0.00000E+00 INFs on MPI task     213
213:  ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
213: #0  0x14891a423372 in ???
213: #1  0x23f19fc in __shr_abort_mod_MOD_shr_abort_backtrace
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
213: #2  0x23f1b83 in __shr_abort_mod_MOD_shr_abort_abort
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
213: #3  0x24361d5 in __shr_reprosum_mod_MOD_shr_reprosum_calc
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_reprosum_mod.F90:644
213: #4  0xc6f638 in __global_norms_mod_MOD_wrap_repro_sum
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/global_norms_mod.F90:864
213: #5  0xcc31e5 in __prim_state_mod_MOD_prim_printstate
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/prim_state_mod.F90:216
213: #6  0xc8a5e3 in __prim_driver_base_MOD_prim_init2
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/prim_driver_base.F90:1033
213: #7  0xf3a909 in __dyn_comp_MOD_dyn_init2
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dyn_comp.F90:380
213: #8  0xc352fe in __inital_MOD_cam_initial
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:73
213: #9  0x520eb3 in __cam_comp_MOD_cam_init
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
213: #10  0x51aad1 in __atm_comp_mct_MOD_atm_init_mct
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
213: #11  0x489151 in __component_mod_MOD_component_init_cc
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
213: #12  0x477ef1 in __cime_comp_mod_MOD_cime_init
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
213: #13  0x4866dc in cime_driver
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
213: #14  0x4866dc in main
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23
213: MPICH ERROR [Rank 213] [job id 692934.0] [Wed May 29 16:50:38 2024] [nid200068] - Abort(1001) (rank 213 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 213

I also try running with OPT, but without -O2 which completed. This was all with gnu using a test like SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu

ndkeen commented 1 month ago

Adjusting compiler flags, I was able to get a stack trace -- which may or may not be same issue.

391: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
391: 
391: Backtrace for this error:
391: #0  0x145ddf423372 in ???
391: #1  0x145ddf422505 in ???
391: #2  0x145dde851dbf in ???
391: #3  0xcb15ec in __eos_MOD_pnh_and_exner_from_eos2
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:121
391: #4  0xcb238f in __eos_MOD_pnh_and_exner_from_eos
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
391: #5  0xcaea84 in __element_ops_MOD_tests_finalize
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
391: #6  0xcb068f in __element_ops_MOD_set_thermostate
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
391: #7  0xf32265 in __inidat_MOD_read_inidat
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
391: #8  0xd3684e in __startup_initialconds_MOD_initial_conds
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
391: #9  0xc34dd7 in __inital_MOD_cam_initial
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67
391: #10  0x5209c9 in __cam_comp_MOD_cam_init
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
391: #11  0x51a5e7 in __atm_comp_mct_MOD_atm_init_mct
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
391: #12  0x48903b in __component_mod_MOD_component_init_cc
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
391: #13  0x477dd1 in __cime_comp_mod_MOD_cime_init
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
391: #14  0x4865c6 in cime_driver
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
391: #15  0x4865c6 in main
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23

components/homme/src/theta-l/share/eos.F90

  ! check for bad state that will crash exponential function below                                                                                                                                                                                                                                                                           
  if (theta_hydrostatic_mode) then
    ierr= any(dp3d(:,:,:) < 0 ) ! <-- line 121
  else
    ierr= any(vtheta_dp(:,:,:) < 0 )  .or. &
          any(dp3d(:,:,:) < 0 ) .or. &
          any(dphi(:,:,:) > 0 )
  endif
ndkeen commented 1 month ago

With a slightly diff flag variation I see this error:

  2:  bad state in EOS, called from: not specified
  2:  bad i,j,k=           1           4          42
  2:  vertical column: dphi,dp3d,vtheta_dp
  2:   1           NaN        4.7223    15234.5361
  2:   2           NaN        6.9384    21262.8696
  2:   3           NaN       10.1644    28133.1975
  2:   4           NaN       14.8259    35098.1232
  2:   5           NaN       21.4903    46351.4278
  2:   6           NaN       30.8760    62234.7248
  2:   7           NaN       43.8245    79485.2115
  2:   8           NaN       61.2040    99465.7967
  2:   9           NaN       83.7154   122918.9372
  2:  10           NaN      111.5971   147073.4247
  2:  11           NaN      144.2831   166819.3864
  2:  12           NaN      180.1535   182548.8554

...

  2:  ERROR: EOS bad state: d(phi), dp3d or vtheta_dp < 0
  2: #0  0x1470f0c23372 in ???
  2: #1  0x23c0c04 in __shr_abort_mod_MOD_shr_abort_backtrace
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
  2: #2  0x23c0d8b in __shr_abort_mod_MOD_shr_abort_abort
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
  2: #3  0x57673b in __cam_abortutils_MOD_endrun
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/utils/cam_abortutils.F90:60
  2: #4  0xc7f2c5 in __parallel_mod_MOD_abortmp
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/parallel_mod.F90:278
  2: #5  0xcb1923 in __eos_MOD_pnh_and_exner_from_eos2
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:140
  2: #6  0xcb2325 in __eos_MOD_pnh_and_exner_from_eos
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
  2: #7  0xcaea1a in __element_ops_MOD_tests_finalize
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
  2: #8  0xcb0625 in __element_ops_MOD_set_thermostate
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
  2: #9  0xf321fb in __inidat_MOD_read_inidat
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
  2: #10  0xd367e4 in __startup_initialconds_MOD_initial_conds
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
  2: #11  0xc34d6d in __inital_MOD_cam_initial
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67

/mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/mullerup/SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.NDEBUG-Og/run/e3sm.log.692939.240529-174020
ndkeen commented 4 weeks ago

I've been adjusting compiler flags in attempt to debug. With -g -O -DNDEBUG -ffpe-trap=invalid,zero, I'm able to run a DEBUG case to get a quick FPE. Adding checks on arrays higher up, I see there are issues with the data just after it gets read in from file. The test on that data does not catch an issue as there is a mask involved. So I don't know if the issue is with the data, the mask, or how the rest of the code expects/uses the data.

components/eam/src/dynamics/se/inidat.F90

    fieldname = 'PS'
    tmp(:,:,:) = 0.0_r8 ! ndk try further init                                                                                                                                                                                                                         
    tmp(:,1,:) = 0.0_r8
    call t_startf('read_inidat_infld')
    if (.not. scm_multcols) then
      call infld(fieldname, ncid_ini, ncol_name,      &
           1, npsq, 1, nelemd, tmp(:,1,:), found, gridname=grid_name)
    else
      call infld(fieldname, ncid_ini, ncol_name,      &
           1, 1, 1, 1, tmp(:,1,:), found, gridname=grid_name)
    endif
    call t_stopf('read_inidat_infld')
    if(.not. found) then
       call endrun('Could not find PS field on input datafile')
    end if

    ! Check read-in data to make sure it is in the appropriate units                                                                                                                                                                                                   
    allocate(tmpmask(npsq,nelemd))
    tmpmask = (reshape(ldof, (/npsq,nelemd/)) /= 0)

    if(minval(tmp(:,1,:), mask=tmpmask) < 10000._r8 .and. .not. scm_multcols) then
       call endrun('Problem reading ps field')
    end if

    ierr= any(tmp(1,1,:) < 0.0 ) !ndk  this test will catch FPE's that will later have issues