EDmodel / ED2

Ecosystem Demography Model
78 stars 112 forks source link

SIGFPE error: idealdenssh #94

Closed crollinson closed 9 years ago

crollinson commented 9 years ago

I just received the following SIGFPE error in almost all of my sites after trying to turn disturbance on after doing a 200-year bare-ground disturbance off spin. It was running the current GitHub mainline version. I haven't gotten a chance to dig into it and might not get a chance for a while, but I wanted to throw it out there in case anybody might know something about it.

Program received signal 8 (SIGFPE): Floating-point exception.

Backtrace for this error:

mpaiao commented 9 years ago

@crollinson This crash is likely to be some uninitialised variable during the patch dynamics, not a problem with the thermodynamic library. When you say disturbance on, do you mean anthropogenic disturbance or treefall/fire? Also, does it crash right at the beginning of the simulation after you turn the disturbance on, or does it run a few years then it crashes?

crollinson commented 9 years ago

@mpaiao The crash happens 10-15 years into the simulation after turning on fire and treefall. The crash appears to be happening in path fusion in the transition to a new year -- it happens before a history file for January 1 can be made.

mpaiao commented 9 years ago

@crollinson My hunch is that fires are causing the trouble: if they are strong they may disturb the entire area, and the default fire survivorship is 0. Zero area or zero population are also good candidates for FPE...

One suggestion is to save monthly history files (or even daily if it's possible), then start the simulation using the history file for December 1st or 31st of the year before the crash, this time using ED compiled with the most strict debugging flags. This should tell the first occurrence of some floating point exception.

crollinson commented 9 years ago

Now that I'm back to debugging ED: similar thermodynamics errors are still occurring with all disturbances turned off:

At line 169 of file ed_therm_lib.f90 Fortran runtime error: Array reference out of bounds for array 'csite', lower bound of dimension 1 exceeded (0 < 1)

Backtrace for this error:

a few sites are still giving the error from #109 and I'm suspecting the two are connected and there's something off in dbalive_dt

mdietze commented 9 years ago

Well, the error itself is an indexing error (csite = 0), not a thermodynamic error

On Wed, Aug 19, 2015 at 2:11 PM, Christy Rollinson <notifications@github.com

wrote:

Now that I'm back to debugging ED: similar thermodynamics errors are still occurring with all disturbances turned off:

At line 169 of file ed_therm_lib.f90 Fortran runtime error: Array reference out of bounds for array 'csite', lower bound of dimension 1 exceeded (0 < 1)

Backtrace for this error:

  • function __ed_therm_lib_MOD_update_veg_energy_cweh (0x822B07) at line 169 of file ed_therm_lib.f90
  • function __growth_balive_MOD_dbalive_dt (0xEE9202) at line 302 of file growth_balive.f90
  • function vegetationdynamics (0xC6FD6F) at line 79 of file vegetation_dynamics.f90
  • function edmodel (0x514F94) at line 401 of file ed_model.F90
  • function eddriver (0x434CD9) at line 297 of file ed_driver.F90
  • in the main program at line 285 of file edmain.F90
  • /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4541afbd5d] /var/spool/sge/scc-gb08/job_scripts/7408840: line 14: 4487 Quit

a few sites are still giving the error from #109 https://github.com/EDmodel/ED2/issues/109 and I'm suspecting the two are connected and there's something off in dbalive_dt

— Reply to this email directly or view it on GitHub https://github.com/EDmodel/ED2/issues/94#issuecomment-132726584.

crollinson commented 9 years ago

This appears to be a grass-specific problem that occurs when we try to make grasses cold-deciduous when IGRASS=1 (because then bdead=0)... going to double check a few more things, but it looks like this solves both this issue and #109

crollinson commented 9 years ago

A follow-up: While this issue and #109 seem to be grass-specific problems, right now reproduction happens in every month and phenology is based solely on drought. While not necessarily reflective of nature, I think it may make sense to restrict reproduction to when a PFT is leafed out. For temperate regions, this means that it could occur at any time for evergreens, but only doing the growing season for deciduous plants. Thoughts?

crollinson commented 9 years ago

closing this & making a separate issue for reproduction phenology #112

crollinson commented 9 years ago

Still getting this error after several fixes to cohorts and the issue is at the patch fusion level. Based on some recent patterns, I'm thinking this may be a precision error that might be helped by switching the patch fusion thermodynamics. This is indeed a separate error from #109 (cohort fusion) & #112 (reproduction)

Aside from making the code bulkier to convert everything to double prec., does anybody know a reason I shouldn't try it?

mdietze commented 9 years ago

I'd go ahead and give it a try in a branch. If you restrict yourself to patch fusion (i.e. keep out of the integrator) this shouldn't get out of hand. Once upon a time double prec was slower, but now almost all machines are 64 bit.

mpaiao commented 9 years ago

I think it's fine to try double precision. I'm just wondering if we should consider turning the entire code to double precision at some point...

crollinson commented 9 years ago

Double precision (or at least my clunky implementation) didn't help. Increasing min_patch_area from 0.01 to ~0.015 or greater seems to work for now.

crollinson commented 9 years ago

so the problem is somehow the changes @ryankelly-uiuc made in Jan 2015 here: https://github.com/EDmodel/ED2/commit/2a5d68ebb291581c932a442e2701e553b24b1170 got reverted back to what @mpaiao had in Nov 2014 (https://github.com/EDmodel/ED2/commit/5f27ef308c71f68ccc6e4cfbb350aed98a8596a5)

What changed is the rules for whether you actually do the dmean loop etc. and it randomly gets kicked on and causes problems.

2014 version: if (writing_long .and. (.not. fuse_initial) ) then

2015 version: if (writing_long) then if ( all(csite%dmean_can_prss > 10.0) ) then

I can't find where/when this happened, but I suspect it was an oversight during conflict resolution from this large pull request: #56

Is there a reason the can_prss flag disappeared and I should not tack than on after the .not. fuse_initial?

crollinson commented 9 years ago

submitted pull request #122 to re-implment Ryan's fixes in the ED mainline.