Runtime error with nvidia compiler on pm-cpu after recent PR

ndkeen commented 1 month ago

We had some build errors in last week or so with nvidia compiler that was corrected (https://github.com/E3SM-Project/E3SM/issues/6332), but now we see runtime errors in init with several tests on pm-cpu. For example:

SMS_D.ne4pg2_oQU480.F2010.pm-cpu_nvidia.eam-cosplite
SMS.ne4pg2_oQU480.F2010.pm-cpu_nvidia.eam-cosplite
SMS_D.ne4pg2_oQU480.F2010.pm-cpu_nvidia
SMS.ne4pg2_oQU480.F2010.pm-cpu_nvidia

same with 1 MPI:
SMS_D_P1x1.ne4pg2_oQU480.F2010.pm-cpu_nvidia

Checking out different hashes, I see this issue started happening after https://github.com/E3SM-Project/E3SM/pull/6311

The error is not very useful. Even with DEBUG:

 0: (seq_comm_printcomms)    38     0    96     1  CPLIAC:
 1: MPIIO WARNING: DVS stripe width of 24 was requested but DVS set it to 1
 1: See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
 1: MPIIO WARNING: DVS stripe width of 24 was requested but DVS set it to 1
 1: See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
srun: error: nid004342: tasks 0-95: Bus error
srun: Terminating StepId=24439248.0

I also tried with default version of nvidia compiler and still see same issue. Currently we have 22.7 and I just tried with 23.9

ndkeen commented 1 month ago

If I let it write core files, it tells me:

#0  0x000000000330e2f0 in phys_grid_ctem::phys_grid_ctem_reg () at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/components/eam/src/physics/cam/phys_grid_ctem.F90:137
#1  0x0000000003761032 in inital::cam_initial (dyn_in=..., dyn_out=..., nlfilename=...) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/components/eam/src/dynamics/se/inital.F90:47
#2  0x00000000027fb070 in cam_comp::cam_init (cam_out=0x0, cam_in=0x0, stop_ymd=10106, stop_tod=0) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/components/eam/src/control/cam_comp.F90:162
#3  0x00000000027dd6bc in atm_comp_mct::atm_init_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=..., nlfilename=...) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/components/eam/src/cpl/atm_comp_mct.F90:369
#4  0x0000000000936996 in component_mod::component_init_cc (eclock=..., comp=..., comp_init=-443987883, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=...) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/component_mod.F90:257
#5  0x00000000008fcaba in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/cime_comp_mod.F90:1488
#6  0x000000000093394f in cime_driver () at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/cime_driver.F90:122

ndkeen commented 3 weeks ago

I was debugging this more. In the newly added subroutine, we see:

  subroutine phys_grid_ctem_reg
    !...
    real(r8) :: zalats(nzalat)
    !....
    if (.not. do_tem_diags) return
    !...  zalats array actually used

where the value of nzalat is initialized to -huge(1), so that this routine given that value to auto-allocate zalats even if it quickly returns.

Even though this is a bit awkward, it should still be ok as standard says it will get 0 size. Other compilers OK with this. I then see that with nvidia, we set -Mstack_arrays. Without this flag, the failing cases are able to run. So it is likely a bug in compiler. One pretty easy work-around is to disable that flag for the one fortran unit. As E3SM cannot remove flags conditionally, only add them, I can add -Mnostack_arrays. Other work-arounds might be to only call these routines when do_tem_diags=.true. or set the initial value of nzalat to be something like 0.

I also created a reproducer.

MODULE shr_kind_mod
   public
   integer,parameter :: SHR_KIND_R8 = selected_real_kind(12) ! 8 byte real
   integer,parameter :: SHR_KIND_R4 = selected_real_kind( 6) ! 4 byte real
   integer,parameter :: SHR_KIND_RN = kind(1.0)              ! native real
   integer,parameter :: SHR_KIND_I8 = selected_int_kind (13) ! 8 byte integer
   integer,parameter :: SHR_KIND_I4 = selected_int_kind ( 6) ! 4 byte integer
   integer,parameter :: SHR_KIND_IN = kind(1)                ! native integer

END MODULE shr_kind_mod

module phys_grid_ctem
  use shr_kind_mod, only : r8 => shr_kind_r8

  implicit none
  private
  integer :: nzalat = -huge(1)
  logical :: do_tem_diags = .false.

  public :: phys_grid_ctem_reg
contains
  subroutine phys_grid_ctem_reg
    real(r8) :: zalats(nzalat)
    real(r8) :: z1(nzalat), z2(nzalat), z3(nzalat), z4(nzalat), z5(nzalat)
    integer :: j
    print*, "nzalat=", nzalat
    print*, "size(zalats)", size(zalats)
    if (.not. do_tem_diags) return
    ! actually use zalats
    do j = 1,nzalat
       zalats(j) = 1.0+zalats(j)
    enddo
  end subroutine phys_grid_ctem_reg
end module phys_grid_ctem

program boop
  use phys_grid_ctem
  implicit none

  call phys_grid_ctem_reg
  print*, "Done"
end program boop

Using following should fail: nvfortran -i4 -Mstack_arrays -Mextend -byteswapio -Mflushz -Kieee -Mallocatable=03 -traceback -O0 -g -Ktrap=fp -Mbounds -Kieee -Mfree arrayallocate-oddvalue.f90

E3SM-Project / E3SM

Runtime error with nvidia compiler on pm-cpu after recent PR #6350