EDmodel / ED2

Ecosystem Demography Model
78 stars 112 forks source link

SIGFPE fault in twostream_rad.f90 #55

Closed crollinson closed 6 years ago

crollinson commented 9 years ago

2 out of sites just got this error about 65 years into my full runs (post-spin up from bare ground). Haven't tracked it down yet, but wanted to throw it out there incase anybody had any immediate thoughts. It's reproducible, so not an initialization error.


Traceback:

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:


Cited Lines of code in twostream_rad.f90

     !---------------------------------------------------------------------------------!
     !     We find the inverse optical depth of the direct radiation (mu0), following  !
     ! CLM10 (equation 3.3 and text after equation 3.3).                               !
     !---------------------------------------------------------------------------------!
     proj_area(i) = phi1(ipft) + phi2(ipft) * czen
     mu0      (i) = - etai(i)                                                          &
                    / log( ( 1.d0 - cai(i) )                                           &
                         + cai(i) * exp( - proj_area(i) * etai(i)                      &
                                         / ( cai(i) * czen ) ) )
     !---------------------------------------------------------------------------------!


Link to ED2IN & Species Parameterization Settings

ED2IN: https://github.com/crollinson/ED_Processing/blob/master/runs_ED2IN/ED2IN.PHO

PFT Configuration: https://github.com/crollinson/ED_Processing/blob/master/PalEON_Phase1a.v2.xml

I'll take a look at the output sometime in the next couple days and let folks know if it's a problem with my settings and/or if there's a bug that needs to be fixed. I haven't looked at the radiation dynamics though, so of somebody has any thoughts, I'd be really grateful.

mpaiao commented 9 years ago

Christy, I can think of two possibilities, though none of them should happen...

  1. Somehow the denominator is becoming zero. This could only happen if either CAI or ETAI is zero. It shouldn't even enter in the routine if ETAI is close to zero because leaf_resolvable/wood_resolvable would be false, and I think CAI should never be zero.
  2. The term inside the exponent is very small, and you are finding underflow. Not very likely because this uses double precision, but if this is the case, you could bound the term inside the exponent by lnexp_min/lnexp_max:
use consts_coms, only : lnexp_min8, lnexp_max8
...
lnexp_now8 = max(lnexp_min8,min(lnexp_max8,-proj_area(i)*etai(i)/(cai(i)*czen)))
mu0(i) = -etai(i) / log( 1.d0 - cai(i) + cai(i) * exp(lnexp_now8))
crollinson commented 9 years ago

@mpaiao Just tried your overflow fix and it made things much, much worse. It wouldn't even start running and I got the following error:


Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception. Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception.

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error: fn.1 (0xAA4 at line 952 of file radiate_driver.f90 at line 602 of file two at line 952 of file rad at line 952 of file radiate_driver.f90

Backtrace for this error: ate_driver.f90 he core sizelimit \ is currently zero.

Backtrace for this error: ate_driver.f90 Backtrace for this error: ate_driver.f90 he cor Backtrace for this error:


mpaiao commented 9 years ago

@crollinson I just checked your code and it seems you commented out the line that defines proj_area(i).

crollinson commented 9 years ago

Well that would do it. That's what happens when I debug before having coffee. I switched over to the multiple scatter scheme for the time being, but I'll run some tests later to see if the underflow fix works when I don't comment out important parts.

mpaiao commented 9 years ago

@crollinson I'm thinking more about this, and the term inside the exponential should be always less than zero, and because cai is between 0 and 1 (but never 0), the term inside the log should be always between 0 and 1. If this is not happening, you may get a SIGFPE message, but then the problem is in the input, not in the subroutine itself.

Instead of my previous suggestion It may make more sense to add some temporary sanity check to understand what is causing the FPE:

lnexp_now8 = -proj_area(i)*etai(i)/(cai(i)*czen)
if (cai(i) == 0.0 .or. cai(i) > 1.0 .or. lnexp_now8 >= 0.0 .or. lnexp_now8 < lnexp_min8) then
   write(unit=*,fmt='(a,1x,i12)'   ) ' i                     = ',i
   write(unit=*,fmt='(a,1x,i12)'   ) ' ipft                  = ',ipft
   write(unit=*,fmt='(a,1x,es12.5)') ' phi1(ipft)            = ',phi1(ipft)
   write(unit=*,fmt='(a,1x,es12.5)') ' phi2(ipft)            = ',phi2(ipft)
   write(unit=*,fmt='(a,1x,es12.5)') ' lai(i)                = ',lai(i)
   write(unit=*,fmt='(a,1x,es12.5)') ' wai(i)                = ',wai(i)
   write(unit=*,fmt='(a,1x,es12.5)') ' cai(i)                = ',cai(i)
   write(unit=*,fmt='(a,1x,es12.5)') ' clumping_factor(ipft) = ',clumping_factor(ipft)
   write(unit=*,fmt='(a,1x,es12.5)') ' etai(i)               = ',etai(i)
   write(unit=*,fmt='(a,1x,es12.5)') ' proj_area(i)          = ',proj_area(i)
   call fatal_error('Invalid values','sw_two_stream','twostream_rad.f90')
end if
crollinson commented 9 years ago

@mpaiao it looks like the problem probably was from the input. Overnight I had more crashes stemming from the radiation while running the multiple scatter. This time it looks like the traceback goes to the way the code is splitting the radiation.


Backtrace

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:



Relevant Code:

  !------------------------------------------------------------------------------------!
  !     Find the fraction of PAR and NIR that stays as beam, using equations 11 and 12 !
  ! of WN85.                                                                           !
  !------------------------------------------------------------------------------------!
  !----- Make sure that the ratio is bounded. -----------------------------------------!
  aux_par       = min(wn85_11(1), max(0., ratio))
  aux_nir       = min(wn85_12(1), max(0., ratio))
  fvis_beam_act = min(1., max(0., par_beam_pot                                         &
                             * (1. - ((wn85_11(1) - aux_par)/wn85_11(2)) ** twothirds) &
                             / par_full_pot ) )
  fnir_beam_act = min(1., max(0., nir_beam_pot                                         &
                             * (1. - ((wn85_12(1) - aux_nir)/wn85_12(2)) ** twothirds) &
                             / nir_full_pot ) )
  fvis_diff_act = 1. - fvis_beam_act
  fnir_diff_act = 1. - fnir_beam_act
  !------------------------------------------------------------------------------------!

The traceback is pointing to the / nir_full_pot which is calculated as: !------------------------------------------------------------------------------------! ! Find the potential direct and diffuse near-infrared radiation, using equations ! ! 4, 5, and 10 of WN85. ! !------------------------------------------------------------------------------------! nir_beam_pot = ( nir_beam_top & * exp ( nir_beam_expext * (atm_prss / prefsea) * secz) - w10 ) * cosz nir_diff_pot = nir2diff_sun * ( nir_beam_top - nir_beam_pot - w10 ) * cosz nir_full_pot = nir_beam_pot + nir_diff_pot !------------------------------------------------------------------------------------!


I haven't looked into our met drivers yet to see if theres a problem there (possible, but hopefully not).

crollinson commented 9 years ago

Update: if I can the shortwave radiation computation to SiB (IMETRAD = 1) instead of Weiss & Norman (IMETRAD = 2) I get through the crash point. I'm under a time constraint and just need things to run for the moment, but it looks like there's some bug or needed sanity check in Weiss & Norman that should get looked into when somebody has the time.

rgknox commented 9 years ago

Thanks for the thread Christy, upgrading this label to bug.