E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

NaNs in ieflx_gmean calc on KNLs with Intel compiler #2178

Closed amametjanov closed 6 years ago

amametjanov commented 6 years ago

Logging an issue to track down the location of this error. The NaN is in column 8 of chunk 127455.

SHR_REPROSUM_CALC: Input contains  0.10000E+01 NaNs and  0.00000E+00 INFs on process   20527
 ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image              PC                Routine            Line        Source
e3sm.exe           000000000312B45F  shr_abort_mod_mp_         114  shr_abort_mod.F90
e3sm.exe           000000000324F04E  shr_reprosum_mod_         428  shr_reprosum_mod.F90
e3sm.exe           00000000005BBA2B  phys_gmean_mp_gme         417  phys_gmean.F90
e3sm.exe           00000000010594C0  check_energy_mp_i         709  check_energy.F90
e3sm.exe           0000000000623B8F  physpkg_mp_phys_r        1210  physpkg.F90
e3sm.exe           0000000000505626  cam_comp_mp_cam_r         285  cam_comp.F90
e3sm.exe           00000000004F2287  atm_comp_mct_mp_a         501  atm_comp_mct.F90
e3sm.exe           0000000000429664  component_mod_mp_         728  component_mod.F90
e3sm.exe           000000000040F172  cime_comp_mod_mp_        3371  cime_comp_mod.F90
e3sm.exe           0000000000429370  MAIN__                    103  cime_driver.F90

Source of the NaN is in one of

singhbalwinder commented 6 years ago

Thanks @amametjanov for reporting this. I have seen this error before but with a different e3sm configuration. Which compset are you using and what resolution? Is this reproducible? If yes, do you get this with debug flags turned on? Hope is that the debug flags might reveal where it first originates.

amametjanov commented 6 years ago

Yes, second time seeing this with SMS_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite. Debug runs are also continuing but not running into this yet.

amametjanov commented 6 years ago

Looking through the history file (*.cam.rh0.0001-01-31-00000.nc), fields like cam_out(lchnk)%precsc(:ncol) have very small values: e.g.

    1.88079096131566e-40, 7.7582627154271e-40, 2.1084395886461e-84, 0,
    4.17619485951906e-56, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Theta's machine precision is

                                        -23
 Single precision =  0.11920929E-06 or 2
                                        -52
 Double precision =  0.22204460E-15 or 2

Arithmetic ops with such small numbers can be causing NaNs.

rljacob commented 6 years ago

Could those numbers be junk from an array that wasn't initialized to zero? I doubt small numbers like that are computed by the model.

amametjanov commented 6 years ago

It looks like they are being computed :) cam_in and cam_out arrays are initialized to 0 in components/cam/src/control/camsrfexch.F90. History files are here: e.g. /projects/ClimateEnergy_2/azamatm/SMS_Ld31_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite.20180313_181914_l5lwlk/precsc.out (ncdump -t -v var1 *.nc).

singhbalwinder commented 6 years ago

We can catch instances where these small numbers are generated using compiler's underflow flags. I have never seen them causing any issues in the past. I remember using this flag long time back and the code would crash very early on with underflow detection. They always seemed harmless to me but it may depend on the compiler too as some compilers would automatically set underflow to zero while others won't.

amametjanov commented 6 years ago

Additional data point about initialization. ATM after 4th step in the continued run has:

   Current step number:              3841
 ...
 nstep, te     3841   0.33411446843176217E+10   0.33411468040667338E+10   0.23448612056170461E-03   0.98507177590207240E+05
 nstep, te     3842   0.33411357703415484E+10   0.33411378602920594E+10   0.23118982348604186E-03   0.98507170438543981E+05
 nstep, te     3843   0.33411305601672096E+10   0.33411319091116810E+10   0.14921992066070427E-03   0.98507165635626530E+05
 nstep, te     3844   0.33411247440697255E+10   0.33411262608280902E+10   0.16778346050163977E-03   0.98507161897076017E+05

And these checks:

@@ -702,7 +702,29 @@ subroutine ieflx_gmean(state, tend, pbuf2d, cam_in, cam_out, nstep)
        case default
           call endrun('*** incorrect ieflx_opt ***')
        end select
-
+       do i=1,ncol
+         if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
+           write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(:,1)
+         endif
+         if (cam_in(lchnk)%ts(i) /= cam_in(lchnk)%ts(i)) then
+           write(iulog,*) 'NaN in ts',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%ts(:)
+         endif
+         if (cam_out(lchnk)%precc(i) /= cam_out(lchnk)%precc(i)) then
+           write(iulog,*) 'NaN in precc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precc(:),'rain',rain(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precl(i) /= cam_out(lchnk)%precl(i)) then
+           write(iulog,*) 'NaN in precl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precl(:),'rain',rain(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precsc(i) /= cam_out(lchnk)%precsc(i)) then
+           write(iulog,*) 'NaN in precsc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsc(:),'snow',snow(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precsl(i) /= cam_out(lchnk)%precsl(i)) then
+           write(iulog,*) 'NaN in precsl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsl(:),'snow',snow(:,lchnk)
+         endif
+         if (ienet(i,lchnk) /= ienet(i,lchnk)) then
+           write(iulog,*) 'NaN in ienet',i,lchnk,ncol,ieflx_opt,ienet(i,lchnk)
+         endif
+       enddo

show in e3sm.log:

(seq_domain_areafactinit) : min/max drv2mdl   0.999999923354153       1.00000007509469    areafact_o_OCN
 NaN in clfx           9      114015           9           2
  3.971921184855017E-005  1.642525045843797E-004  2.947733991644233E-005
  4.784900201901546E-005  7.489103276731714E-005  9.108095325323595E-006
  4.129502134087975E-006  2.144881251408060E-004                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 NaN in ts           9      114015           9           2
   300.460276761804        295.163237366316        294.255842339761
   301.902920996472        299.266058591620        286.463401697433
   252.900251885251        288.404882510593                          NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 NaN in precl           9      114015           9           2
  8.721439523793770E-010  6.797591931365890E-013  7.631128467863829E-011
  5.912707206395664E-020  3.992641589920015E-007  3.504465270537491E-010
  5.558511966756323E-011  1.501216548661270E-008                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000 rain  3.849603926957925E-009  6.797591931365890E-013
  7.631128467863829E-011  5.912707206395664E-020  5.036713899589927E-007
  5.794703939578485E-008  0.000000000000000E+000  5.997597268333353E-010
                     NaN  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000
 NaN in precsl           9      114015           9           2
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  4.504836162751876E-021
  5.558511966756323E-011  1.441240575977937E-008                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000 snow  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  4.504836162751876E-021  5.558511966756323E-011  1.635619849632815E-008
                     NaN  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000
 NaN in ienet           9      114015           9           2
                     NaN

So the arrays of size 16 are initialized to 0, but get NaNs in fields

  162 PRECL                            m/s                 1 A  Large-scale (stable) precipitation rate (liq + ice)
  168 PRECSL                           m/s                 1 A  Large-scale (stable) snow rate (water equivalent)
  174 QFLX                             kg/m2/s             1 A  Surface water flux
  188 TS                               K                   1 A  Surface temperature (radiative)

May need to check these fields in history files.

PeterCaldwell commented 6 years ago

@amametjanov - are you saying that these NaNs show up for the first time after step 4, or are they formed upon initialization and persist through step 4? Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird. I could understand a problem at the top level, but how can levels above and below get assigned and an intermediate level have problems? One possibility is that cloud physics isn't applied at the top few levels of the domain. Last I checked, cloud physics in the top 7 levels are ignored: https://acme-climate.atlassian.net/wiki/spaces/ATM/pages/129511233/Does+trop+cloud+top+press+have+any+impact . I was thinking there could be a problem with stitching the cloudy and cloud-free parts of the model together. Another possibility is that there's a problem with the bottom of the sponge layer at the top of the model. @mt5555 - do you know how many layers are part of the sponge? I'm also curious whether these NaNs show up in low-res simulations. Could you re-run A_WCYCL1850S at ne30 with your print-statemented code, @amametjanov ?

amametjanov commented 6 years ago

Yes, showing up for the first time after step 4 (twice in these restart runs and once after step 1 in a 145-node startup run). Reprosum calculation has a check for NaNs and INFs and will abort/endrun if there is any such value in the summation. If ne30 runs did not encounter the SHR_REPROSUM_CALC endrun, then they never had a NaN/INF in their calculations. Turning to compiler flags to catch NaNs earlier.

PeterCaldwell commented 6 years ago

Hmm. I think we can conclude from the fact that NaNs show up on step 4 that this is not an initialization problem. Does this problem always show up on step 4, or does the timestep it shows up on vary? If always step 4, is there something special about step 4 (e.g. radiation is called every hour = every 4 steps)? Is this error reproducible in the sense that all identical simulations fail, or do some get past step 4? If the latter, then we have a reproducibility problem (which my analysis today of my current movie run also suggests).

singhbalwinder commented 6 years ago

Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird.

I think cflx is dimensioned (pcols, pcnst) , therefore the issue is in column 9. ncol is also 9, therefore it is something not assigned to the last column of a chunk which becomes Nan or ncol should be 8 instead of 9 for this chunk. It will be more clear if @amametjanov changes all his "if" conditions to look like:

if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
         write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(i,1)
endif
singhbalwinder commented 6 years ago

Sorry forgot to mention, I changed : in cflx to i

singhbalwinder commented 6 years ago

ncol should be 8 instead of 9 for this chunk

Sorry, I don't think columns should be 8 instead of 9. One reason for this behavior might be that cam_in is going into some subroutine call with intent(out) and only 8 columns are updated in that subroutine. This would make the 9th column to have undefined values, which can be NaN.

worleyph commented 6 years ago

Chunks can have different numbers of columns (and not use all of the available space in a chunk). If some routine is using pcols instead of ncols, this could cause the situation @singhbalwinder is conjecturing about.

amametjanov commented 6 years ago

Should be fixed by ACME-Climate/ACME#2208. I was not able to re-produce this after that PR.

rljacob commented 6 years ago

Az, if these are KNL-specific bugs with KNL-specific solutions, please change the title to include "on KNL".

rljacob commented 6 years ago

To help searches by future users.