E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

Closed worleyph closed 9 years ago

worleyph commented 9 years ago

Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.

I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.

bishtgautam commented 9 years ago

The possibly culprit could be the values returned by BandDiagonal() in SoilTemperatureMod.F90. Can @worleyph or @daliwang or @acme-y9s provide notes on how to create a case (I or B ) to reproduce this error?

worleyph commented 9 years ago

This "works" (i.e. doesn't work) for me

create_newcase -case ne30_ICLM45BGC_pgi_test -compset ICLM45BGC -res ne30_g16 -mach titan -compiler pgi -project cli115

worleyph commented 9 years ago

@bishtgautam - it appears that the values coming out of BandDiagonal do have problems - t_soisno(c,j) contains NaNs (level 1, so 'soil'). Endrun is called after the first appearance, so others may be NaNs as well.

bishtgautam commented 9 years ago

My guess is that diagonal entries of bmatrix need to be initialize as 1.d0. And, diagonal entries of bmatrix are assembled from bamatrix_snow, bamatrix_ssw, and bamatrix_soil. Possible initialization modifications should happen at line no. 3128, 3777, and 4522

While we are at this, possibly initialization of tvector should be set to 0.d0 at line no. 387

(I'm on travel today, but can take a look at it tomorrow).

worleyph commented 9 years ago

I traced it back to (at least) rt_snow in SetRHSVec (rt_ssw and rt_soil seem okay?) rt_snow is initialized to nan. I checked within the routines that calculate rt_snow and do not see any problems there. However, rt_snow is set only when a number of if-tests are satisfied. It is almost as if rt_snow is not being calcualted for all of the indices that it is referenced for. If so, this would be a real bug, not a pgi problem. I've run out of time and simple ideas at the moment. If you start in on this tomorrow, please send me a note to that effect.

worleyph commented 9 years ago

My latest NaN tests are also reporting NaNs when running with the Intel compiler, but not the ones that "matter". So my diagnosis as to when the NaNs first show up is not accurate. I'll communicate with Gautam directly until we can get this worked out.

rljacob commented 9 years ago

I tried the gnu compiler on blues with CLM45 and it worked (although I haven't tried recently).

bishtgautam commented 9 years ago

I suspect the issue is the product division or/and multiplication of NaN and zero, which is being handled differently by different compilers within the BandDiagonal() solve. I believe ideally for an inactive layer (corresponding to snow or standing water layer), the diagonal term in the matrix should be set to 1._r8 and off-diagonal term set to 0._r8; while corresponding entry in the rvector should be set to 0._r8.

Instead of debugging this error on a global CLM grid, I'm trying to see if I can reproduce this issue on a smaller 1x1 grid (-res 1x1_urbanc_alpha).

worleyph commented 9 years ago

What I am finding at the moment is that the PGI compiler is finding NaNs in result(:)=r(ci,jtop(ci):jbot(ci)) in BandDiagonal while the Intel compiler is not. (Here r(:,:) is an unmodified input parameter. )This seems to imply that there are more NaNs coming out of the call to SetRHSVec using PGI than using Intel. Unfortunately there are NaNs for both (since this is how the fields are initialized) and I don't know which indices are valid and which are not.

bishtgautam commented 9 years ago

(1) Is jtop(ci):jbot(ci) same for PGI and Intel? (2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal() be attributed to differences in rvector from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different number of 0._r8 for PGI and Intel.

singhbalwinder commented 9 years ago

I will give it a try with the NAG compiler to see if it provides us with more info.

On Wed, Apr 1, 2015 at 9:07 PM, Gautam Bisht notifications@github.com wrote:

(1) Is jtop(ci):jbot(ci) same for PGI and Intel? (2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal() be attributed to differences in rvector from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different number of 0._r8 for PGI and Intel.

— Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/165#issuecomment-88720523.

worleyph commented 9 years ago

(1) Is jtop(ci):jbot(ci) same for PGI and Intel? It clearly is not exactly the same, since PGI has some NaNs and Intel does not.

worleyph commented 9 years ago

With help from @bishtgautam and @singhbalwinder I found a workaround for the problem (for this one case - it will need to be tested more extensively to determine whether it is sufficient). It is a small change, but does appear to be a PGI compiler bug. Someone else will have to decide whether it is worth reporting. According to @bishtgautam , the clm4_5_r097 tag, being brought in for consideration for V2, works fine for this case with PGI.

In the routine InitCold in the file TemperatureType.F90, the input parameters are declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:)

If these are instead declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:bounds%endl)

then the code works. Note that immediately following this are statements to the effect

SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_wall_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_improad_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_perroad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))

These assert statements are tested only when compiled in DEBUG mode. @singhbalwinder trying doing just this with the NAG compiler, but got a seg. fault, so that wasn't very informative.

This style of code is used through this part of CLM45 - leave off upper bounds on input paramters and immediately following test, using SHR_ASSERT_ALL, that the upper bounds are the expected value. I have no idea what the history of this coding style is.

In this particular failure mode, if you do not specify the upper bound, 'ubound{em_roof_lun)' returns 1701611158 . Probing what is going on later in the code leads to a seg. fault. Not 'probing' results in certain values not being set (leaving the original NaN initialization).

I am reassigning this to @bishtgautam to test and fix.

worleyph commented 9 years ago

Also, our version of InitCold appears identical to that in clm4_5_r097, so the source of the problem is elsewhere. This workaround is just that. A further investigation might find a better fix, but I'll leave that to the Land Group to decide if they want to pursue this.

bishtgautam commented 9 years ago

I verified that clm4_5_r088 works with PGI on Titan for the same ICLM45BGC case that fails with ACME master.

Tag name: clm4_5_1_r088 Originator(s): muszala (Stefan Muszala) Date: Wed Oct 1 09:24:43 MDT 2014 One-line Summary: Pull out ED deps. in TemperatureTypeMod, can now compile with pgi 14.7

Purpose of changes: Pull out the dependency on EDBioType in TemperatureType.F90. The ED variables related to phenology now reside in EDPhenologyMod.F90. This refactor also had the effect of getting past a PGI 14.7 ICE which looks like it was due to the use of EDbio_vars in TemperatureType.F90. When I pulled out lines 1227 and 1226 of biogeophys/TemperatureType.F90 (in clm4_5_1_r087) and passed the two EDbio_vars variables through the argument list the ICE went away.

This tag breaks ED restart tests. We went ahead with the tag because we had to fix a more general problem with the CESM and CAM builds and PGI 14.7. the ED v0.1.0 branch does not have these modifications and may be used as an alternative. A new clm tag will shortly follow that addresses any remaining problems.

After adding #if 0 and #endif for code between line no. 1227-1267 of TemperatureType.F90, the ICLM45BGC ran successfully on Titan.

Q) Is it worth making code changes in ACME Land Model (ALM) that would also work for ED code given that we are going create a fork for ED development? If yes, there were additional ED related changes made in clm4_5_r086 that we may have to bring in ALM. (@climate-dude, others: Any thoughts?)

climate-dude commented 9 years ago

I think these fixes are worth bringing into the mainline ALM development. I thought @thorntonpe was making some major changes in TemperatureType.F90, but I have not seen these come in yet. To get past compiler issues, we should bring in these workarounds.

worleyph commented 9 years ago

This decision is clearly the Land Group's. I just want to note that the NCAR commit that eliminated the problem was "inadvertent". It was not the result of a diagnosis and targeted change. We similarly have a modification that is equally effective (and equally a workaround). The decision should be based on whether any of the mods in r88 are justified on their own. Perhaps @thorntonpe 's changes will inadvertently eliminate this problem as well? Or we can rip out any 'use_ed' code blocks, since that worked for @bishtgautam, as I don't think that we are using this in V1?

bishtgautam commented 9 years ago

After further testing I realized that #ifdef 0 doesn't fixes the problem and Pat's solution of adding bounds%endl is the only thing that does work.

worleyph commented 9 years ago

I enabled all tests of the form SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) throughout the code and the only failures were in InitCold . (It died with the check on em_roof_lun - I haven't checked the others yet.) So, good news I guess. Making the change that @bishtgautam submitted in his PR addresses the only known impact of this compiler bug.