Closed worleyph closed 9 years ago
The possibly culprit could be the values returned by BandDiagonal() in SoilTemperatureMod.F90. Can @worleyph or @daliwang or @acme-y9s provide notes on how to create a case (I or B ) to reproduce this error?
This "works" (i.e. doesn't work) for me
create_newcase -case ne30_ICLM45BGC_pgi_test -compset ICLM45BGC -res ne30_g16 -mach titan -compiler pgi -project cli115
@bishtgautam - it appears that the values coming out of BandDiagonal do have problems - t_soisno(c,j) contains NaNs (level 1, so 'soil'). Endrun is called after the first appearance, so others may be NaNs as well.
My guess is that diagonal entries of bmatrix
need to be initialize as 1.d0
. And, diagonal entries of bmatrix
are assembled from bamatrix_snow
, bamatrix_ssw
, and bamatrix_soil
. Possible initialization modifications should happen at line no. 3128, 3777, and 4522
While we are at this, possibly initialization of tvector
should be set to 0.d0
at line no. 387
(I'm on travel today, but can take a look at it tomorrow).
I traced it back to (at least) rt_snow in SetRHSVec (rt_ssw and rt_soil seem okay?) rt_snow is initialized to nan. I checked within the routines that calculate rt_snow and do not see any problems there. However, rt_snow is set only when a number of if-tests are satisfied. It is almost as if rt_snow is not being calcualted for all of the indices that it is referenced for. If so, this would be a real bug, not a pgi problem. I've run out of time and simple ideas at the moment. If you start in on this tomorrow, please send me a note to that effect.
My latest NaN tests are also reporting NaNs when running with the Intel compiler, but not the ones that "matter". So my diagnosis as to when the NaNs first show up is not accurate. I'll communicate with Gautam directly until we can get this worked out.
I tried the gnu compiler on blues with CLM45 and it worked (although I haven't tried recently).
I suspect the issue is the product division or/and multiplication of NaN and zero, which is being handled differently by different compilers within the BandDiagonal() solve. I believe ideally for an inactive layer (corresponding to snow or standing water layer), the diagonal term in the matrix should be set to 1._r8
and off-diagonal term set to 0._r8
; while corresponding entry in the rvector should be set to 0._r8
.
Instead of debugging this error on a global CLM grid, I'm trying to see if I can reproduce this issue on a smaller 1x1 grid (-res 1x1_urbanc_alpha
).
What I am finding at the moment is that the PGI compiler is finding NaNs in result(:)=r(ci,jtop(ci):jbot(ci)) in BandDiagonal while the Intel compiler is not. (Here r(:,:) is an unmodified input parameter. )This seems to imply that there are more NaNs coming out of the call to SetRHSVec using PGI than using Intel. Unfortunately there are NaNs for both (since this is how the fields are initialized) and I don't know which indices are valid and which are not.
(1) Is jtop(ci):jbot(ci)
same for PGI and Intel?
(2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal()
be attributed to differences in rvector
from SetRHSVec()
. Maybe the bmatrix
from SetMatrix()
also has different number of 0._r8
for PGI and Intel.
I will give it a try with the NAG compiler to see if it provides us with more info.
On Wed, Apr 1, 2015 at 9:07 PM, Gautam Bisht notifications@github.com wrote:
(1) Is jtop(ci):jbot(ci) same for PGI and Intel? (2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal() be attributed to differences in rvector from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different number of 0._r8 for PGI and Intel.
— Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/165#issuecomment-88720523.
(1) Is jtop(ci):jbot(ci) same for PGI and Intel? It clearly is not exactly the same, since PGI has some NaNs and Intel does not.
With help from @bishtgautam and @singhbalwinder I found a workaround for the problem (for this one case - it will need to be tested more extensively to determine whether it is sufficient). It is a small change, but does appear to be a PGI compiler bug. Someone else will have to decide whether it is worth reporting. According to @bishtgautam , the clm4_5_r097 tag, being brought in for consideration for V2, works fine for this case with PGI.
In the routine InitCold in the file TemperatureType.F90, the input parameters are declared
real(r8) , intent(in) :: em_roof_lun(bounds%begl:)
real(r8) , intent(in) :: em_wall_lun(bounds%begl:)
real(r8) , intent(in) :: em_improad_lun(bounds%begl:)
real(r8) , intent(in) :: em_perroad_lun(bounds%begl:)
If these are instead declared
real(r8) , intent(in) :: em_roof_lun(bounds%begl:bounds%endl)
real(r8) , intent(in) :: em_wall_lun(bounds%begl:bounds%endl)
real(r8) , intent(in) :: em_improad_lun(bounds%begl:bounds%endl)
real(r8) , intent(in) :: em_perroad_lun(bounds%begl:bounds%endl)
then the code works. Note that immediately following this are statements to the effect
SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_wall_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_improad_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) SHR_ASSERT_ALL((ubound(em_perroad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
These assert statements are tested only when compiled in DEBUG mode. @singhbalwinder trying doing just this with the NAG compiler, but got a seg. fault, so that wasn't very informative.
This style of code is used through this part of CLM45 - leave off upper bounds on input paramters and immediately following test, using SHR_ASSERT_ALL, that the upper bounds are the expected value. I have no idea what the history of this coding style is.
In this particular failure mode, if you do not specify the upper bound, 'ubound{em_roof_lun)' returns 1701611158 . Probing what is going on later in the code leads to a seg. fault. Not 'probing' results in certain values not being set (leaving the original NaN initialization).
I am reassigning this to @bishtgautam to test and fix.
Also, our version of InitCold appears identical to that in clm4_5_r097, so the source of the problem is elsewhere. This workaround is just that. A further investigation might find a better fix, but I'll leave that to the Land Group to decide if they want to pursue this.
I verified that clm4_5_r088
works with PGI on Titan for the same ICLM45BGC case that fails with ACME master.
Tag name: clm4_5_1_r088 Originator(s): muszala (Stefan Muszala) Date: Wed Oct 1 09:24:43 MDT 2014 One-line Summary: Pull out ED deps. in TemperatureTypeMod, can now compile with pgi 14.7
Purpose of changes: Pull out the dependency on EDBioType in TemperatureType.F90. The ED variables related to phenology now reside in EDPhenologyMod.F90. This refactor also had the effect of getting past a PGI 14.7 ICE which looks like it was due to the use of EDbio_vars in TemperatureType.F90. When I pulled out lines 1227 and 1226 of biogeophys/TemperatureType.F90 (in clm4_5_1_r087) and passed the two EDbio_vars variables through the argument list the ICE went away.
This tag breaks ED restart tests. We went ahead with the tag because we had to fix a more general problem with the CESM and CAM builds and PGI 14.7. the ED v0.1.0 branch does not have these modifications and may be used as an alternative. A new clm tag will shortly follow that addresses any remaining problems.
After adding #if 0
and #endif
for code between line no. 1227-1267 of TemperatureType.F90, the ICLM45BGC ran successfully on Titan.
Q) Is it worth making code changes in ACME Land Model (ALM) that would also work for ED code given that we are going create a fork for ED development? If yes, there were additional ED related changes made in clm4_5_r086
that we may have to bring in ALM. (@climate-dude, others: Any thoughts?)
I think these fixes are worth bringing into the mainline ALM development. I thought @thorntonpe was making some major changes in TemperatureType.F90, but I have not seen these come in yet. To get past compiler issues, we should bring in these workarounds.
This decision is clearly the Land Group's. I just want to note that the NCAR commit that eliminated the problem was "inadvertent". It was not the result of a diagnosis and targeted change. We similarly have a modification that is equally effective (and equally a workaround). The decision should be based on whether any of the mods in r88 are justified on their own. Perhaps @thorntonpe 's changes will inadvertently eliminate this problem as well? Or we can rip out any 'use_ed' code blocks, since that worked for @bishtgautam, as I don't think that we are using this in V1?
After further testing I realized that #ifdef 0
doesn't fixes the problem and Pat's solution of adding bounds%endl
is the only thing that does work.
I enabled all tests of the form SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) throughout the code and the only failures were in InitCold . (It died with the check on em_roof_lun - I haven't checked the others yet.) So, good news I guess. Making the change that @bishtgautam submitted in his PR addresses the only known impact of this compiler bug.
Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.
I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.