Closed amametjanov closed 6 years ago
Thanks @amametjanov for reporting this. I have seen this error before but with a different e3sm configuration. Which compset are you using and what resolution? Is this reproducible? If yes, do you get this with debug flags turned on? Hope is that the debug flags might reveal where it first originates.
Yes, second time seeing this with SMS_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite
. Debug runs are also continuing but not running into this yet.
Looking through the history file (*.cam.rh0.0001-01-31-00000.nc
), fields like cam_out(lchnk)%precsc(:ncol)
have very small values: e.g.
1.88079096131566e-40, 7.7582627154271e-40, 2.1084395886461e-84, 0,
4.17619485951906e-56, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Theta's machine precision is
-23
Single precision = 0.11920929E-06 or 2
-52
Double precision = 0.22204460E-15 or 2
Arithmetic ops with such small numbers can be causing NaNs.
Could those numbers be junk from an array that wasn't initialized to zero? I doubt small numbers like that are computed by the model.
It looks like they are being computed :) cam_in
and cam_out
arrays are initialized to 0 in components/cam/src/control/camsrfexch.F90
. History files are here: e.g. /projects/ClimateEnergy_2/azamatm/SMS_Ld31_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite.20180313_181914_l5lwlk/precsc.out
(ncdump -t -v var1 *.nc
).
We can catch instances where these small numbers are generated using compiler's underflow flags. I have never seen them causing any issues in the past. I remember using this flag long time back and the code would crash very early on with underflow detection. They always seemed harmless to me but it may depend on the compiler too as some compilers would automatically set underflow to zero while others won't.
Additional data point about initialization. ATM after 4th step in the continued run has:
Current step number: 3841
...
nstep, te 3841 0.33411446843176217E+10 0.33411468040667338E+10 0.23448612056170461E-03 0.98507177590207240E+05
nstep, te 3842 0.33411357703415484E+10 0.33411378602920594E+10 0.23118982348604186E-03 0.98507170438543981E+05
nstep, te 3843 0.33411305601672096E+10 0.33411319091116810E+10 0.14921992066070427E-03 0.98507165635626530E+05
nstep, te 3844 0.33411247440697255E+10 0.33411262608280902E+10 0.16778346050163977E-03 0.98507161897076017E+05
And these checks:
@@ -702,7 +702,29 @@ subroutine ieflx_gmean(state, tend, pbuf2d, cam_in, cam_out, nstep)
case default
call endrun('*** incorrect ieflx_opt ***')
end select
-
+ do i=1,ncol
+ if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
+ write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(:,1)
+ endif
+ if (cam_in(lchnk)%ts(i) /= cam_in(lchnk)%ts(i)) then
+ write(iulog,*) 'NaN in ts',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%ts(:)
+ endif
+ if (cam_out(lchnk)%precc(i) /= cam_out(lchnk)%precc(i)) then
+ write(iulog,*) 'NaN in precc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precc(:),'rain',rain(:,lchnk)
+ endif
+ if (cam_out(lchnk)%precl(i) /= cam_out(lchnk)%precl(i)) then
+ write(iulog,*) 'NaN in precl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precl(:),'rain',rain(:,lchnk)
+ endif
+ if (cam_out(lchnk)%precsc(i) /= cam_out(lchnk)%precsc(i)) then
+ write(iulog,*) 'NaN in precsc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsc(:),'snow',snow(:,lchnk)
+ endif
+ if (cam_out(lchnk)%precsl(i) /= cam_out(lchnk)%precsl(i)) then
+ write(iulog,*) 'NaN in precsl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsl(:),'snow',snow(:,lchnk)
+ endif
+ if (ienet(i,lchnk) /= ienet(i,lchnk)) then
+ write(iulog,*) 'NaN in ienet',i,lchnk,ncol,ieflx_opt,ienet(i,lchnk)
+ endif
+ enddo
show in e3sm.log:
(seq_domain_areafactinit) : min/max drv2mdl 0.999999923354153 1.00000007509469 areafact_o_OCN
NaN in clfx 9 114015 9 2
3.971921184855017E-005 1.642525045843797E-004 2.947733991644233E-005
4.784900201901546E-005 7.489103276731714E-005 9.108095325323595E-006
4.129502134087975E-006 2.144881251408060E-004 NaN
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000
NaN in ts 9 114015 9 2
300.460276761804 295.163237366316 294.255842339761
301.902920996472 299.266058591620 286.463401697433
252.900251885251 288.404882510593 NaN
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000
NaN in precl 9 114015 9 2
8.721439523793770E-010 6.797591931365890E-013 7.631128467863829E-011
5.912707206395664E-020 3.992641589920015E-007 3.504465270537491E-010
5.558511966756323E-011 1.501216548661270E-008 NaN
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 rain 3.849603926957925E-009 6.797591931365890E-013
7.631128467863829E-011 5.912707206395664E-020 5.036713899589927E-007
5.794703939578485E-008 0.000000000000000E+000 5.997597268333353E-010
NaN 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000
NaN in precsl 9 114015 9 2
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 4.504836162751876E-021
5.558511966756323E-011 1.441240575977937E-008 NaN
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 snow 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
4.504836162751876E-021 5.558511966756323E-011 1.635619849632815E-008
NaN 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000
NaN in ienet 9 114015 9 2
NaN
So the arrays of size 16 are initialized to 0, but get NaNs in fields
162 PRECL m/s 1 A Large-scale (stable) precipitation rate (liq + ice)
168 PRECSL m/s 1 A Large-scale (stable) snow rate (water equivalent)
174 QFLX kg/m2/s 1 A Surface water flux
188 TS K 1 A Surface temperature (radiative)
May need to check these fields in history files.
@amametjanov - are you saying that these NaNs show up for the first time after step 4, or are they formed upon initialization and persist through step 4? Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird. I could understand a problem at the top level, but how can levels above and below get assigned and an intermediate level have problems? One possibility is that cloud physics isn't applied at the top few levels of the domain. Last I checked, cloud physics in the top 7 levels are ignored: https://acme-climate.atlassian.net/wiki/spaces/ATM/pages/129511233/Does+trop+cloud+top+press+have+any+impact . I was thinking there could be a problem with stitching the cloudy and cloud-free parts of the model together. Another possibility is that there's a problem with the bottom of the sponge layer at the top of the model. @mt5555 - do you know how many layers are part of the sponge? I'm also curious whether these NaNs show up in low-res simulations. Could you re-run A_WCYCL1850S at ne30 with your print-statemented code, @amametjanov ?
Yes, showing up for the first time after step 4 (twice in these restart runs and once after step 1 in a 145-node startup run). Reprosum calculation has a check for NaNs and INFs and will abort/endrun if there is any such value in the summation. If ne30 runs did not encounter the SHR_REPROSUM_CALC
endrun, then they never had a NaN/INF in their calculations. Turning to compiler flags to catch NaNs earlier.
Hmm. I think we can conclude from the fact that NaNs show up on step 4 that this is not an initialization problem. Does this problem always show up on step 4, or does the timestep it shows up on vary? If always step 4, is there something special about step 4 (e.g. radiation is called every hour = every 4 steps)? Is this error reproducible in the sense that all identical simulations fail, or do some get past step 4? If the latter, then we have a reproducibility problem (which my analysis today of my current movie run also suggests).
Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird.
I think cflx is dimensioned (pcols, pcnst) , therefore the issue is in column 9. ncol
is also 9, therefore it is something not assigned to the last column of a chunk which becomes Nan or ncol should be 8 instead of 9 for this chunk. It will be more clear if @amametjanov changes all his "if" conditions to look like:
if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(i,1)
endif
Sorry forgot to mention, I changed :
in cflx to i
ncol should be 8 instead of 9 for this chunk
Sorry, I don't think columns should be 8 instead of 9. One reason for this behavior might be that cam_in
is going into some subroutine call with intent(out)
and only 8 columns are updated in that subroutine. This would make the 9th column to have undefined values, which can be NaN.
Chunks can have different numbers of columns (and not use all of the available space in a chunk). If some routine is using pcols instead of ncols, this could cause the situation @singhbalwinder is conjecturing about.
Should be fixed by ACME-Climate/ACME#2208. I was not able to re-produce this after that PR.
Az, if these are KNL-specific bugs with KNL-specific solutions, please change the title to include "on KNL".
To help searches by future users.
Logging an issue to track down the location of this error. The NaN is in column 8 of chunk 127455.
Source of the NaN is in one of
cam_in(lchnk)%cflx(:ncol,1)
cam_out(lchnk)%precsc(:ncol)
cam_out(lchnk)%precsl(:ncol)
cam_out(lchnk)%precc(:ncol)
cam_out(lchnk)%precl(:ncol)