E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
351 stars 360 forks source link

E3SM Land Developer Tests Failing in DEBUG Mode #4820

Closed peterdschwartz closed 1 year ago

peterdschwartz commented 2 years ago

With current master, I ran many of the land developer tests in debug mode ( case directory is here: /lcrc/group/e3sm/ac.schwartzpd/master-debug-tests/cases) and it seems any that use the chemistry of ELM all fail. I believe this has been an issue for a while so I can't say how many code changes this will require and not each test failed in the same place. Almost all of them are due to the triggering of a floating point exception edit: due to evaluation of logical expressions involving NaNs.

A potential fix is to add shr_infnan_isnan consistently as that will not trigger an fpe error. An potential downside of shr_infnan_isnan is I am not sure how to readily port them to use in GPU regions -- a GPU friendly check that I implemented is to use if statements like if(.not. (x .ne. x) ) which trigger a fpe (the FireMod backtraces below). I can make a elm wrapper function that switches between the two at compilation but will have to check how adding an extra function call impacts performance.

Currently, many new development runs cannot be run in DEBUG mode without removal of -fpe0 flag, and since I do not believe it is a good policy that an atm developer should have to contact an elm developer just to enable debug mode, then the -fpe0 should be temporarily removed for elm files. And tests will need to be added to catch these issues sooner as the current ELM debug tests are clearly inadequate.

ERS.f09_g16.I1850ELMCN.chrysalis_intel
ERS.f09_g16.I1850GSWCNPRDCTCBC.chrysalis_intel.elm-vstrd (different traceback)
ERS.f19_f19.I20TRELMCN.chrysalis_intel
ERS.f09_g16.I1850ELMCN.chrysalis_intel.elm-bgcinterface
ERS.r05_r05.IELM.chrysalis_intel.elm-V2_ELM_MOSART_features
ERS.f19_g16.I1850CNECACNTBC.chrysalis_intel.elm-eca
ERS.f19_g16.I1850GSWCNPECACNTBC.chrysalis_intel.elm-eca_f19_g16_I1850GSWCNPECACNTBC (no traceback just seg fault)
ERS.f19_g16.I20TRGSWCNPECACNTBC.chrysalis_intel.elm-eca_f19_g16_I20TRGSWCNPECACNTBC
ERS.f19_g16.I20TRGSWCNPRDCTCBC.chrysalis_intel.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC
ERS.f19_g16.I1850CNRDCTCBC.chrysalis_intel.elm-rd (no traceback just seg fault)
ERS.f19_g16.I1850ELM.chrysalis_intel.elm-betr (betr traceback)
SMS_Ld1.hcru_hcru.I1850CRUELMCN.chrysalis_intel
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.chrysalis_intel.elm-lulcc_sville (FireMod traceback)
SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.chrysalis_intel.elm-per_crop
SMS.r05_r05.I1850ELMCN.chrysalis_intel.elm-qian_1948

@rljacob @bishtgautam

ndkeen commented 2 years ago

If I try SMS_D.f09_g16.I1850ELMCN on chrysalis, I do get FP issue as suggested. Looking closer into this one, it looks like the array holding index values is what might be the problem col_pp%landunit. At least when I add lines like this below to components/elm/src/main/subgridAveMod.F90.

Note I'm setting g=0 (which should not affect anything here) just as a no-op so that can see exactly what compiler is trapping on.

    do c = bounds%begc,bounds%endc
       if (col_pp%active(c) .and. col_pp%wtgcell(c) /= 0._r8) then
          l = col_pp%landunit(c)
          if (spval > 0) g=0 ! ndk                                                                                                                                                                                       
          if (carr(c) /= spval) g=0 ! ndk                                                                                                                                                                                
          if (scale_c2l(c) /= spval) g=0 ! ndk                                                                                                                                                                           
          write(*,*) "ndk c=", c, " l=", l   !<-- where fail is now -- ie i think it's the l value
          if (col_pp%landunit(c) > 0) g=0 ! ndk                                                                                                                                                                          
          if (scale_l2g(l) /= spval) g=0 ! ndk                                                                                                                                                                           
          if (carr(c) /= spval .and. scale_c2l(c) /= spval .and. scale_l2g(l) /= spval) then  !<-- where initial fail shows
             g = col_pp%gridcell(c)
             if (sumwt(g) == 0._r8) garr(g) = 0._r8
             garr(g) = garr(g) + carr(c) * scale_c2l(c) * scale_l2g(l) * col_pp%wtgcell(c)
             sumwt(g) = sumwt(g) + col_pp%wtgcell(c)
          end if
       end if
    end do

!     66:  ndk c=      130560  l=       39814                                                                                                                                                                            
! 66: forrtl: error (78): process killed (SIGTERM)                                                                                                                                                                       
! 66: Image              PC                Routine            Line        Source                                                                                                                                         
! 66: libpnetcdf.so.3.0  000015555406F6BC  for__signal_handl     Unknown  Unknown                                                                                                                                        
! 66: libpthread-2.28.s  0000155551A79B20  Unknown               Unknown  Unknown                                                                                                                                        
! 66: libpthread-2.28.s  0000155551A78845  __write               Unknown  Unknown                                                                                                                                        
! 66: libpnetcdf.so.3.0  000015555407A60E  for__write_output     Unknown  Unknown                                                                                                                                        
! 66: libpnetcdf.so.3.0  000015555407B70E  for__put_sf           Unknown  Unknown                                                                                                                                        
! 66: e3sm.exe           00000000063D9CB4  Unknown               Unknown  Unknown                                                                                                                                        
! 66: e3sm.exe           0000000000E8F7D5  subgridavemod_mp_        1078  subgridAveMod.F90                                                                                                                              
peterdschwartz commented 2 years ago

Thanks for following up @ndkeen . Something to try is to remove -fpe0 that way the bounds and pointer check can work better and provide the source of error (in my experience). This run may be due to a typo in CNPBudgetMod.F90 that I have already fixed, but fixing that by itself will give an fpe in a different spot.

I hope that makes sense -- if not I can try looking at this SMS test specifically rather than the ERS test I used.

ndkeen commented 2 years ago

I think in general, we benefit from having floating-point traps. If I understand, it sounds like the fp trapping did catch this error? I would just like to see example of a floating-point trap that you do not want to stop on. So can I run with updated code to correct said typo? Or a different test perhaps?

And yes, I do think it's a very bad idea to set arrays to nan on purpose (which I know land does), but that may or may not be the issue here.

peterdschwartz commented 2 years ago

So what I mean by it hindering debugging is that it raises this error but gives a pretty useless backtrace. Removing the fpe flag will give you the exact line number of the bug. If you try what I suggest, you may see what I mean for your SMS test case. Moreover, resolving the bug will still give a fpe in the ELM because it has so many initializations to NaNs combined with inconsistent checking of NaNs (edit: my working hypothesis here that I believe has solid testing evidence so far). In other words, you will get fpe errors with no real bugs, and so fpe checking is not good until this inconsistency can be resolved.

peterdschwartz commented 2 years ago

@ndkeen Sorry I just realized I missed your direct question, I can make a branch this afternoon with the typo fixed to aid you in either confirming or refuting my tests

rljacob commented 2 years ago

Can you paste an example here of "NaN evaluation in logical expression" in ELM ?

peterdschwartz commented 2 years ago

@rljacob I worded that incorrectly: it should be "evaluation of a logical expression involving NaNs" So a floating point exception will get triggered on if statements like this: if (carr(c) /= spval .and. scale_c2l(c) /= spval .and. scale_l2g(l) /= spval) then My surface understanding is that for the IEEE standard x == y is equivalent to x - y == 0 so there are implicit floating point operations that will trigger the signaling NaN with the fpe0 flag.
It's also why filtering NaNs with an if statement like if(.not. (x .ne. x) ) will trigger the fpe and it is necessary to really use a compiler intrinsic isNaN(x) to test.

rljacob commented 2 years ago

Could ELM just set spval to a large non-physical number instead of NaN?

peterdschwartz commented 2 years ago

@rljacob It would require some additional changes to filter out by spval or huge in some loops, but that may be the best option. I did that for the initial GPU port some time ago and got non-BFB answers because some routines (consciously or unconsciously I don't know) rely on the fact that NaNs evaluate to false in if statements. I think this is a question about deciding on a coding practice and sticking to it.

peterdschwartz commented 2 years ago

@ndkeen The branch peterdschwartz/lnd/cnpbudget-typo-fix fixes the typo that you can use. I guess I didn't explicitly mention it in the original post, but every test I listed there is one that FAILs in debug mode with fpe0 turned on.

Also, the traceback pointing to the index l couldn't be a fpe since l is an integer. But, this was a memory out of bounds bug which probably corrupted the memory and is giving wonky results. Probably worth running a memcheck on my branch to see if there are any other corruption issues.

ndkeen commented 2 years ago

With your branch, I still see the same error. And you're right -- doesnt make sense to see FP error on line with integer array. When I add the debug lines above and try again, it fails here: if (carr(c) /= spval) g=0 ! ndk suggesting that maybe carr(c=126349) has an issue?

I also tried on cori-knl with Intel SMS_D.f09_g16.I1850ELMCN and it fails as well in same way. However, I know that at some point, DEBUG cases were working with Intel and fp traps on. Trying with a few older repos, I see that a repo from April 30th 2021 works with this test. But a repo dated July 16th 2021 (and several later repos) does not work (same error).

I just wonder if there really is an error somewhere.

rljacob commented 2 years ago

What is ELM trying to do by initializing values this way? Yes there should be an E3SM programming standard for this and it would start with never purposefully initializing variables to NaN.

peterdschwartz commented 2 years ago

I don't know. They also overload the assignment operation for entire modules for data types with some special NaN assignment elemental function.
use shr_infnan_mod , only : isnan => shr_infnan_isnan,nan => shr_infnan_nan, assignment(=) Seems like something that could easily be abused and have unintended consequences as some of these modules do also have science routines in them.

peterdschwartz commented 2 years ago

I also tried on cori-knl with Intel SMS_D.f09_g16.I1850ELMCN and it fails as well in same way. However, I know that at some point, DEBUG cases were working with Intel and fp traps on. Trying with a few older repos, I see that a repo from April 30th 2021 works with this test. But a repo dated July 16th 2021 (and several later repos) does not work (same error).

I just wonder if there really is an error somewhere.

@ndkeen This is promising that it may just be a few places to fix. A few major land PRs went in around then, but so did the initial implementation of carbon budgets so I will focus in on that first.

ndkeen commented 2 years ago

I reminded myself that I created an issue quite a while ago that does mention these fails. https://github.com/E3SM-Project/E3SM/issues/3123

I see that for GNU builds, we have removed the trap on "invalid". If I put that back, then SMS_D.f09_g16.I1850ELMCN also fails for GNU.

While it may be bad idea to init arrays to nan, it doesn't necessarily mean fp-traps will balk -- I could write a simple test, but can you have array set to nan, but then not evaluated/used until after it is reset to a value? I'm just suggesting that it might be good to track/fix this issue -- or at least verify the fp-trap is triggered simply because a land array is set to nan and there's a good reason why it wasn't reset to value.

peterdschwartz commented 2 years ago

Ha, hopefully we can get some closure on this 2.5 year issue. Also looks like your last test in November shows the memory corruption/out-of-bounds that I just fixed

 At line 1051 of file /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_add-compiler-flag-for-one-mpas-source-with-gnu/components/elm/src/biogeochem/CNPBudgetMod.F90
  0: Fortran runtime error: Index '20' of dimension 1 of array 'budg_statel' above upper bound of 19

I struggling to imagine exactly what you are saying code wise. IEEE standard does allow quiet NaNs and signaling NaNs and each compiler has different levels of fpe detecting. What exactly -fpe0 is doing isn't something I can speak to with confidence and may differ between compilers.

One of the crop smallville tests raises the exception at this line L373 in FireMod.F90

if( .not. (btran2(p) .ne. btran2(p)))then !?shr_infnan_isnan(btran2(p))) then

Using the special intrinsic (commented out here), won't raise an exception and why ELM overloads the assignment operation to initialize NaNs too. This is one of the main reasons I think it's NaN triggering the exceptions. But, as you have mentioned, each test may not be failing for the same reason and so it's definitely worth further investigation.

rljacob commented 1 year ago

This initialize-to-NaN issue came up in atmospheric chemistry too (i think). There was some logic to first initialize to nan, then later the variable is initialized to a correct value and then still later there is a check to make sure it was initialized correctly by checking for NaN. Its a built-in debugging system for adding new chemistry reactions.

That's reasonable but here's what should be done in the program: Initialize to a variable that can be set at compile time to NaN but by default is just a large number. That way the programmer can activate it when checking a new chemistry reaction but the default is to NOT initialize to NaN so debugging can work.

rgknox commented 1 year ago

@rljacob does E3SM have any examples of this strategy employed yet? if so we can try to reproduce this in FATES

rljacob commented 1 year ago

I don't think so.