Closed peterdschwartz closed 1 year ago
If I try SMS_D.f09_g16.I1850ELMCN
on chrysalis, I do get FP issue as suggested.
Looking closer into this one, it looks like the array holding index values is what might be the problem col_pp%landunit
.
At least when I add lines like this below to components/elm/src/main/subgridAveMod.F90
.
Note I'm setting g=0 (which should not affect anything here) just as a no-op so that can see exactly what compiler is trapping on.
do c = bounds%begc,bounds%endc
if (col_pp%active(c) .and. col_pp%wtgcell(c) /= 0._r8) then
l = col_pp%landunit(c)
if (spval > 0) g=0 ! ndk
if (carr(c) /= spval) g=0 ! ndk
if (scale_c2l(c) /= spval) g=0 ! ndk
write(*,*) "ndk c=", c, " l=", l !<-- where fail is now -- ie i think it's the l value
if (col_pp%landunit(c) > 0) g=0 ! ndk
if (scale_l2g(l) /= spval) g=0 ! ndk
if (carr(c) /= spval .and. scale_c2l(c) /= spval .and. scale_l2g(l) /= spval) then !<-- where initial fail shows
g = col_pp%gridcell(c)
if (sumwt(g) == 0._r8) garr(g) = 0._r8
garr(g) = garr(g) + carr(c) * scale_c2l(c) * scale_l2g(l) * col_pp%wtgcell(c)
sumwt(g) = sumwt(g) + col_pp%wtgcell(c)
end if
end if
end do
! 66: ndk c= 130560 l= 39814
! 66: forrtl: error (78): process killed (SIGTERM)
! 66: Image PC Routine Line Source
! 66: libpnetcdf.so.3.0 000015555406F6BC for__signal_handl Unknown Unknown
! 66: libpthread-2.28.s 0000155551A79B20 Unknown Unknown Unknown
! 66: libpthread-2.28.s 0000155551A78845 __write Unknown Unknown
! 66: libpnetcdf.so.3.0 000015555407A60E for__write_output Unknown Unknown
! 66: libpnetcdf.so.3.0 000015555407B70E for__put_sf Unknown Unknown
! 66: e3sm.exe 00000000063D9CB4 Unknown Unknown Unknown
! 66: e3sm.exe 0000000000E8F7D5 subgridavemod_mp_ 1078 subgridAveMod.F90
Thanks for following up @ndkeen . Something to try is to remove -fpe0
that way the bounds and pointer check can work better and provide the source of error (in my experience). This run may be due to a typo in CNPBudgetMod.F90 that I have already fixed, but fixing that by itself will give an fpe in a different spot.
I hope that makes sense -- if not I can try looking at this SMS test specifically rather than the ERS test I used.
I think in general, we benefit from having floating-point traps. If I understand, it sounds like the fp trapping did catch this error? I would just like to see example of a floating-point trap that you do not want to stop on. So can I run with updated code to correct said typo? Or a different test perhaps?
And yes, I do think it's a very bad idea to set arrays to nan on purpose (which I know land does), but that may or may not be the issue here.
So what I mean by it hindering debugging is that it raises this error but gives a pretty useless backtrace. Removing the fpe flag will give you the exact line number of the bug. If you try what I suggest, you may see what I mean for your SMS test case. Moreover, resolving the bug will still give a fpe in the ELM because it has so many initializations to NaNs combined with inconsistent checking of NaNs (edit: my working hypothesis here that I believe has solid testing evidence so far). In other words, you will get fpe errors with no real bugs, and so fpe checking is not good until this inconsistency can be resolved.
@ndkeen Sorry I just realized I missed your direct question, I can make a branch this afternoon with the typo fixed to aid you in either confirming or refuting my tests
Can you paste an example here of "NaN evaluation in logical expression" in ELM ?
@rljacob I worded that incorrectly: it should be "evaluation of a logical expression involving NaNs"
So a floating point exception will get triggered on if statements like this:
if (carr(c) /= spval .and. scale_c2l(c) /= spval .and. scale_l2g(l) /= spval) then
My surface understanding is that for the IEEE standard x == y
is equivalent to x - y == 0
so there are implicit floating point operations that will trigger the signaling NaN with the fpe0
flag.
It's also why filtering NaNs with an if statement like if(.not. (x .ne. x) )
will trigger the fpe and it is necessary to really use a compiler intrinsic isNaN(x)
to test.
Could ELM just set spval to a large non-physical number instead of NaN?
@rljacob It would require some additional changes to filter out by spval or huge
in some loops, but that may be the best option. I did that for the initial GPU port some time ago and got non-BFB answers because some routines (consciously or unconsciously I don't know) rely on the fact that NaNs evaluate to false
in if statements. I think this is a question about deciding on a coding practice and sticking to it.
@ndkeen The branch peterdschwartz/lnd/cnpbudget-typo-fix fixes the typo that you can use. I guess I didn't explicitly mention it in the original post, but every test I listed there is one that FAILs in debug mode with fpe0 turned on.
Also, the traceback pointing to the index l
couldn't be a fpe since l
is an integer. But, this was a memory out of bounds bug which probably corrupted the memory and is giving wonky results. Probably worth running a memcheck on my branch to see if there are any other corruption issues.
With your branch, I still see the same error.
And you're right -- doesnt make sense to see FP error on line with integer array.
When I add the debug lines above and try again, it fails here:
if (carr(c) /= spval) g=0 ! ndk
suggesting that maybe carr(c=126349) has an issue?
I also tried on cori-knl with Intel SMS_D.f09_g16.I1850ELMCN
and it fails as well in same way.
However, I know that at some point, DEBUG cases were working with Intel and fp traps on.
Trying with a few older repos, I see that a repo from April 30th 2021 works with this test.
But a repo dated July 16th 2021 (and several later repos) does not work (same error).
I just wonder if there really is an error somewhere.
What is ELM trying to do by initializing values this way? Yes there should be an E3SM programming standard for this and it would start with never purposefully initializing variables to NaN.
I don't know. They also overload the assignment operation for entire modules for data types with some special NaN assignment elemental function.
use shr_infnan_mod , only : isnan => shr_infnan_isnan,nan => shr_infnan_nan, assignment(=)
Seems like something that could easily be abused and have unintended consequences as some of these modules do also have science routines in them.
I also tried on cori-knl with Intel
SMS_D.f09_g16.I1850ELMCN
and it fails as well in same way. However, I know that at some point, DEBUG cases were working with Intel and fp traps on. Trying with a few older repos, I see that a repo from April 30th 2021 works with this test. But a repo dated July 16th 2021 (and several later repos) does not work (same error).I just wonder if there really is an error somewhere.
@ndkeen This is promising that it may just be a few places to fix. A few major land PRs went in around then, but so did the initial implementation of carbon budgets so I will focus in on that first.
I reminded myself that I created an issue quite a while ago that does mention these fails. https://github.com/E3SM-Project/E3SM/issues/3123
I see that for GNU builds, we have removed the trap on "invalid". If I put that back, then SMS_D.f09_g16.I1850ELMCN
also fails for GNU.
While it may be bad idea to init arrays to nan, it doesn't necessarily mean fp-traps will balk -- I could write a simple test, but can you have array set to nan, but then not evaluated/used until after it is reset to a value? I'm just suggesting that it might be good to track/fix this issue -- or at least verify the fp-trap is triggered simply because a land array is set to nan and there's a good reason why it wasn't reset to value.
Ha, hopefully we can get some closure on this 2.5 year issue. Also looks like your last test in November shows the memory corruption/out-of-bounds that I just fixed
At line 1051 of file /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_add-compiler-flag-for-one-mpas-source-with-gnu/components/elm/src/biogeochem/CNPBudgetMod.F90
0: Fortran runtime error: Index '20' of dimension 1 of array 'budg_statel' above upper bound of 19
I struggling to imagine exactly what you are saying code wise. IEEE standard does allow quiet NaNs and signaling NaNs and each compiler has different levels of fpe detecting. What exactly -fpe0
is doing isn't something I can speak to with confidence and may differ between compilers.
One of the crop smallville tests raises the exception at this line L373 in FireMod.F90
if( .not. (btran2(p) .ne. btran2(p)))then !?shr_infnan_isnan(btran2(p))) then
Using the special intrinsic (commented out here), won't raise an exception and why ELM overloads the assignment operation to initialize NaNs too. This is one of the main reasons I think it's NaN triggering the exceptions. But, as you have mentioned, each test may not be failing for the same reason and so it's definitely worth further investigation.
This initialize-to-NaN issue came up in atmospheric chemistry too (i think). There was some logic to first initialize to nan, then later the variable is initialized to a correct value and then still later there is a check to make sure it was initialized correctly by checking for NaN. Its a built-in debugging system for adding new chemistry reactions.
That's reasonable but here's what should be done in the program: Initialize to a variable that can be set at compile time to NaN but by default is just a large number. That way the programmer can activate it when checking a new chemistry reaction but the default is to NOT initialize to NaN so debugging can work.
@rljacob does E3SM have any examples of this strategy employed yet? if so we can try to reproduce this in FATES
I don't think so.
With current master, I ran many of the land developer tests in debug mode ( case directory is here: /lcrc/group/e3sm/ac.schwartzpd/master-debug-tests/cases) and it seems any that use the chemistry of ELM all fail. I believe this has been an issue for a while so I can't say how many code changes this will require and not each test failed in the same place. Almost all of them are due to the triggering of a floating point exception edit: due to evaluation of logical expressions involving NaNs.
A potential fix is to add
shr_infnan_isnan
consistently as that will not trigger an fpe error. An potential downside ofshr_infnan_isnan
is I am not sure how to readily port them to use in GPU regions -- a GPU friendly check that I implemented is to use if statements likeif(.not. (x .ne. x) )
which trigger a fpe (the FireMod backtraces below). I can make a elm wrapper function that switches between the two at compilation but will have to check how adding an extra function call impacts performance.Currently, many new development runs cannot be run in DEBUG mode without removal of
-fpe0
flag, and since I do not believe it is a good policy that an atm developer should have to contact an elm developer just to enable debug mode, then the-fpe0
should be temporarily removed for elm files. And tests will need to be added to catch these issues sooner as the current ELM debug tests are clearly inadequate.@rljacob @bishtgautam