Bug when merging patches

ckoven commented 8 years ago

Summary of Issue:

There appears to be a crashing bug in EDPatchdynamicsmod. The proximal issue is that it is crashing when fuse_2_patches is called. The underlying issue is that the same patch was trying to be deallocated twice.

Expected behavior and actual behavior:

I tried to run a global 4x5 run and it crashed after 74 years. Expected behavior is that it not crash. Actual behavior is that it does crash. Was able to repeat the crash, both with restarts and with a second identical case. The error log pointed to an issue in Patchdynamics:

end run_mct forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source
cesm.exe 0000000000FE6511 Unknown Unknown Unknown cesm.exe 0000000000FE49C7 Unknown Unknown Unknown cesm.exe 0000000000F86A54 Unknown Unknown Unknown cesm.exe 0000000000F86866 Unknown Unknown Unknown cesm.exe 0000000000F1BBF6 Unknown Unknown Unknown cesm.exe 0000000000F26F2D Unknown Unknown Unknown libpthread.so.0 00002B1F65EC5710 Unknown Unknown Unknown cesm.exe 00000000007D5127 edpatchdynamicsmo 1267 EDPatchDynamicsMod.F90 cesm.exe 00000000007CAC61 edmainmod_mp_ed_d 209 EDMainMod.F90 cesm.exe 0000000000504FEC clm_driver_mp_clm 811 clm_driver.F90 cesm.exe 00000000004F5862 lnd_comp_mct_mp_l 436 lnd_comp_mct.F90 cesm.exe 000000000042AA8D component_modmp 1044 component_mod.F90 cesm.exe 00000000004143AF cesm_comp_modmp 2415 cesm_comp_mod.F90 cesm.exe 000000000042838D MAIN__ 93 cesm_driver.F90 cesm.exe 000000000041208E Unknown Unknown Unknown libc.so.6 00002B1F660F2D5D Unknown Unknown Unknown cesm.exe 0000000000411F99 Unknown Unknown Unknown

Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):

standard 4x5 global ED run, run for at least 74 years.

What is the changeset ID of the code, and the machine you are using:

have you modified the code? If so, it must be committed and available for testing:

yes, https://github.com/ckoven/ed-clm/commit/59e6e6e90dee77cceb37e7ebc31be0fb6fa216bd

The original code in the fuse_2_patches had a warning in it from Stefan that that bit of code seemed fragile, so Ryan rewrote it. That code is now a bit more clear, in that it has fewer inline pointers to pointers, and better documented, but with the same functionality. So that didn't actually fix the problem. What did fix the problem is to remove the subsequent line to the one that called fuse_2_patches, which tried to deallocate a piece of memory that is already deallocated in fuse_2_patches. That appears to have fixed the problem, and the code now proceeds onwards with its business.

Screen output or output files showing the error message and context:

ckoven commented 8 years ago

Note that fuse_2_patches is called two times in the code, and the other time is not followed by a deallocate. so we are reasonably confident that removing that line of code is in fact the right thing to do as it makes the call more consistent. Note that @rgknox still thinks its weird to deallocate the argument to a subroutine within the subroutine, which I tend to agree with. So the alternate strategy is to reinstate that deallocate line that follows the call to fuse_2_patches in terminate_patches, add a line to deallocate the patch after the call to fuse_2_patches in fuse_patches, and delete the deallocate call in fuse_2_patches.

ckoven commented 8 years ago

OK, so perhaps that solution does not in fact work. The model now hangs in year 77, which seems like the kind of thing it would do if not-deallocated patches were building up...

ckoven commented 8 years ago

OK, so now I'm confused.

It turns out that there are some write statements already around the offending piece of code: https://github.com/NGEET/ed-clm/blob/master/components/clm/src/ED/biogeochem/EDPatchDynamicsMod.F90#L1305

Looking at the log files, the output from the first one of these is the following:

fusing patches because one is too small 9.989301881637238E-004 5.38576249731163 6483.79440730654 1.67309893708168
1.084576068712460E-026

Note that this is the first time I see this phrase in any of my log files, so this is definitely an infrequent occurrence.

before the code changes in 59e6e6e, it would crash before reaching the next write statement. After the code changes in 59e6e6e, it did reach the write statement, and reported the following:

after fusion 6483.79540623672 1.129761864022804E-026

but it seems that something is wrong as the model is then hanging two years later (unless the hanging is completely unrelated, and is just the appearance of the next bug that it would have encountered anyway, which I won't rule out completely.) Since this only happens once, it doesn't seem like it could be a memory buildup issue, but I don't really understand the logic here.

@rosiealice, I'm guessing this bit of code is wrapped in write statements because it has misbehaved in the past. Any insight into what's going on here?

ckoven commented 8 years ago

OK, I did another test and now I think the subsequent hang is unrelated. The test that I did to arrive at this conclusion was to go back to the master code, and modify only min_patch_area in EDTypesMod.F90 to be smaller by two orders of magnitude, and then re-run from the restart before the original crash occurred. What this meant is that the relevant code in terminate_patches was never triggered and the patches were both kept as-is without fusing. I confirmed this by searching for the string "fusing patches because one is too small" in the log file and it did not appear. The result of the test is that the code proceeds past the original crash point encountered with unmodified master code (in year 74, month 12), and then still hangs at the same timestep (# 1340545, in year 77, month 06) as it does with the changes in 59e6e6e. This would seem highly unlikely to occur if the bug were being encountered in the same gridcell, since the patch structure, and therefore everything's internal state, is different as a result of changing min_patch_area. So, the hang must therefore be happening in a different gridcell.

So: what this means is that:

(a) the original code change in 59e6e6e is likely not the culprit, and may be the correct solution. I believe it is in fact correct, but I have no idea how to confirm that it is correct, other than by staring at the code. it will most definitely not be encountered in any of our standard tests, since it requires an older ecosystem than is possible by running for a short period starting from bare ground.

(b) we have another, unrelated, bug that is causing the model to hang with no other diagnostic clues 77 years into a 4x5 simulation. this would seem to be quite difficult as I have no idea how to go about identifying such a bug.

(c) this all underscores the fact that our current testing infrastructure is not detecting these edge cases that occur under slightly obscure, yet still regularly encountered, situations. I think we need to think about adding some sort of long-term stability test that may not be in the regular suite but that we still run with some regularity to find these things. 200 years at global 4x5 without crashing would seem like a decent criteria to me, and one that we cannot yet reach.

rosiealice commented 8 years ago

Hi Charlie,

Thanks for this. I agree on the need for long term tests. Given how common a problem it is that the test suite doesn't catch anything that happens further out than 5 days, and that lots of different states are reached in ED after much longer than 5 days. The hanging bug is a nasty thing. I never really figured out how to debug those either. Stefan thought that he had fixed the previous one, but it might have been that a CLM tag update meant that the new ED state was different, and so didn't trigger the hanging circumstances. I've chased these around in the past with write statements, but they always lead me to some obscure, non-ED part of the code that is precompiled, or something like that.

Apologies that this isn't more helpful, but nee work on eliminating the small patch issue. It might be caused by something related to being in the state where we have such small patches, nonetheless.

-Rosie

On 18 April 2016 at 12:34, Charlie Koven notifications@github.com wrote:

OK, I did another test and now I think the subsequent hang is unrelated. The test that I did to arrive at this conclusion was to go back to the master code, and modify only min_patch_area in EDTypesMod.F90 to be smaller by two orders of magnitude, and then re-run from the restart before the original crash occurred. What this meant is that the relevant code in terminate_patches was never triggered and the patches were both kept as-is without fusing. I confirmed this by searching for the string "fusing patches because one is too small" in the log file and it did not appear. The result of the test is that the code proceeds past the original crash point encountered with unmodified master code (in year 74, month 12), and then still hangs at the same timestep (# 1340545, in year 77, month 06) as it does with the changes in 59e6e6e https://github.com/NGEET/ed-clm/commit/59e6e6e90dee77cceb37e7ebc31be0fb6fa216bd. This would seem highly unlike ly to occur if the bug were being encountered in the same gridcell, since the patch structure, and therefore everything's internal state, is different as a result of changing min_patch_area. So, the hang must therefore be happening in a different gridcell.

So: what this means is that:

(a) the original code change in 59e6e6e https://github.com/NGEET/ed-clm/commit/59e6e6e90dee77cceb37e7ebc31be0fb6fa216bd is likely not the culprit, and may be the correct solution. I believe it is in fact correct, but I have no idea how to confirm that it is correct, other than by staring at the code. it will most definitely not be encountered in any of our standard tests, since it requires an older ecosystem than is possible by running for a short period starting from bare ground.

(b) we have another, unrelated, bug that is causing the model to hang with no other diagnostic clues 77 years into a 4x5 simulation. this would seem to be quite difficult as I have no idea how to go about identifying such a bug.

(c) this all underscores the fact that our current testing infrastructure is not detecting these edge cases that occur under slightly obscure, yet still regularly encountered, situations. I think we need to think about adding some sort of long-term stability test that may not be in the regular suite but that we still run with some regularity to find these things. 200 years at global 4x5 without crashing would seem like a decent criteria to me, and one that we cannot yet reach.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NGEET/ed-clm/issues/54#issuecomment-211517849

Dr Rosie A. Fisher

Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

ckoven commented 8 years ago

Rosie,

Thanks, thats a good point that we can't completely rule out the possibility of it being in the same patch if its just a matter of something happening due to the small patch size. I guess the other thing I could try would be to increase the value of min_patch_area by an order of magnitude or so from the standard case, and then run a new case from year 1 to see if it still hangs at the same point. Presumably if the bus is related to something in that gridcell, either in an unmerged patch or in a post-merged patch, then it ought to trigger at some earlier date is the merge happened earlier due to the patch crossing the min_patch_area threshold earlier? I can try that but that still strikes me as pretty unlikely.

re: the hanging bugs; yeah, this seems pretty difficult. In your experience, do these tend to be associated with infinite while loops around the patch/cohort structure? I'm quite used to making CLM crash, but not really hang, so it seems to me that this kind of thing might be a unique feature of the while loops in the ED code. One possibility (inelegant but may get the job done) would be to put some sort of extreme-debug-only write statement inside every single one of these. I count about 100 or so of these loops (142 counts of "while" in grepping through the ED subdirectory, but many of these are comments to the enddo statements) spread across 10 or so files: so a lot, but not more than an hour or so in an emacs macro could handle. That way we could bring the model just up to the edge of the hang, and then turn on the write statements to see if we are caught in such a loop. This would seem like the kind of thing that CPP tokens are actually kosher for, as they would add some debuggability to the heavy use of while loops that the ED structure requires, at no runtime cost...

rosiealice commented 8 years ago

Hi.

So, my experience was that the crash happens outside the ED code. To figure this out, you need to add leaving ED and starting ED statements, ideally with something to distinguish the processors. That way, you can tell if it's a debuggable ED thing (as above, for infinite loops) or something else.

On 18 April 2016 at 13:02, Charlie Koven notifications@github.com wrote:

Rosie,

Thanks, thats a good point that we can't completely rule out the possibility of it being in the same patch if its just a matter of something happening due to the small patch size. I guess the other thing I could try would be to increase the value of min_patch_area by an order of magnitude or so from the standard case, and then run a new case from year 1 to see if it still hangs at the same point. Presumably if the bus is related to something in that gridcell, either in an unmerged patch or in a post-merged patch, then it ought to trigger at some earlier date is the merge happened earlier due to the patch crossing the min_patch_area threshold earlier? I can try that but that still strikes me as pretty unlikely.

re: the hanging bugs; yeah, this seems pretty difficult. In your experience, do these tend to be associated with infinite while loops around the patch/cohort structure? I'm quite used to making CLM crash, but not really hang, so it seems to me that this kind of thing might be a unique feature of the while loops in the ED code. One possibility (inelegant but may get the job done) would be to put some sort of extreme-debug-only write statement inside every single one of these. I count about 100 or so of these loops (142 counts of "while" in grepping through the ED subdirectory, but many of these are comments to the enddo statements) spread across 10 or so subroutines: so a lot, but not more than an hour or so in an emacs macro could handle. That way we could bring the model just up to the edge of the hang, and then turn on the write statements to see if we are caught in such a loop. This would seem like the kind of thing that CPP tokens are actually kosher for, as they would ad d some debuggability to the heavy use of while loops that the ED structure requires, at no runtime cost...

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NGEET/ed-clm/issues/54#issuecomment-211530990

Dr Rosie A. Fisher

Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

ckoven commented 8 years ago

OK, so I put in write statements at the ED entry and exit points (i.e. where they currently are, but not just for masterproc), and based on my count of entry and exit stamps in the output log, the hang is happening within the ED daily timestep. So on to plan B: assign a random string to every while loop and output that string every loop iteration to see if one of them is infinite.

rosiealice commented 8 years ago

That's good, actually, since it means that it's likely not the same as the hanging bug I had before, and that it might thus be a tractable thing.

On 18 April 2016 at 18:27, Charlie Koven notifications@github.com wrote:

OK, so I put in write statements at the ED entry and exit points (i.e. where they currently are, but not just for masterproc), and based on my count of entry and exit stamps in the output log, the hang is happening within the ED daily timestep. So on to plan B: assign a random string to every while loop and output that string every loop iteration to see if one of them is infinite.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NGEET/ed-clm/issues/54#issuecomment-211648884

Dr Rosie A. Fisher

Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

ckoven commented 8 years ago

agreed. for my own future reference, and possibly as a team resource if/when the next time this kind of thing crops up, attached an emacs function I found via google for writing a random string, since there is (slightly shockingly) not such a feature by default. put in ~/.emacs.d/init.el ...

(defun insert-random-string () "Insert a random alphanumeric string of length 6." ;; code copied from here: https://groups.google.com/forum/#!topic/comp.lang.lisp/iza6WR4VJNY (interactive) (let (mycharset (ii 0) ) (setq mycharset ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "y" "v" "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "1" "2" "3" "4" "5" "6" "7" "8" "9" "0"] ) (while (< ii 6) (insert (elt mycharset (random (length mycharset)))) (setq ii (1+ ii) ) ) ) )

ckoven commented 8 years ago

well, that appears to have worked. it seems to be caught in an infinite loop within this area loop: https://github.com/NGEET/ed-clm/blob/master/components/clm/src/ED/biogeochem/EDCanopyStructureMod.F90#L130

looks like there are a bunch of cohort while loops within that, and none of them are triggering write statements, so perhaps it is in a patch with no cohorts...

rosiealice commented 8 years ago

nice job!

On 18 April 2016 at 21:38, Charlie Koven notifications@github.com wrote:

well, that appears to have worked. it seems to be caught in an infinite loop within this area loop:

https://github.com/NGEET/ed-clm/blob/master/components/clm/src/ED/biogeochem/EDCanopyStructureMod.F90#L130

looks like there are a bunch of cohort while loops within that, and none of them are triggering write statements, so perhaps it is in a patch with no cohorts...

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NGEET/ed-clm/issues/54#issuecomment-211714097

Dr Rosie A. Fisher

Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

ckoven commented 8 years ago

OK, so I think I sorted this out. the error is in the canopy structure code, and happens when patch area is too small. So I put in some logic to only do the canopy structure calculation when patch%area is greater than min_patch_area. this is now in tag 9d0dd5d

Also, this is unrelated to the first bug I encountered because I checked and they are happening in different gridcells, though both are consequences of small patch areas when things age.

So two questions remain: (1) why is this happening? shouldn't the min_patch_area logic prevent small patch areas from occurring? (2) do I need to set any of the canopy structure variables by hand to give reasonable numbers when I skip the calculation for small patch areas? the current code proceeds beyond the original hang point now, but would like to confirm that nothing weird ought to happen by skipping these calculations.

ckoven commented 8 years ago

OK, now the model crashes again a few years later. I suspect that what's going on here is that the min_patch_area criteria is not being obeyed. to me the key line is this:

https://github.com/NGEET/ed-clm/blob/master/components/clm/src/ED/biogeochem/EDPatchDynamicsMod.F90#L1302

The hedge in that line is that it doesn't force the merge if either (a) the tiny patch is the oldest, or (b) the tiny patch is the youngest. I get why it needs to not force youngest patches to merge (since they will have area added to them via disturbance, whereas the old patches ought only get smaller). But I suggest we more strongly enforce the min_patch_area threshold at the old end. I will try out some logic to force the merge of a tiny oldest patch into its younger neighbor as a first step, and see if that helps out the stability of the model as things get older.

ckoven commented 8 years ago

OK, the change to better enforce the min_patch_area criteria (bb8ac55) avoids that subsequent crasher that I had run into. The model fused a bunch of patches in the first timestep after restart and for now appears to be chugging along past where it crashed before.

ckoven commented 8 years ago

With the most recent code change in place (bb8ac55), the model now seems like its stable on long-term runs. It has run to 120 years now and things seem to still be running smoothly.

@rosiealice: I am not sure whether to undo the changes in 9d0dd5d or leave them in. There is still the potential for small patches to go through the canopy construction step if they are the youngest patch on a site, as the fusion logic still lets those persist. Checking the log files, this does seem to still be possible, though after a while a younger patch appears, and once the younger neighbor is in place the tiny patch immediately fuses into it. But it is an unlikely condition, as it has only happened once in the 50 years at the global 4x5 conditions since I fixed the fusion logic. My guess is that the canopy structure mod is acting as the canary in the coalmine for small-patch numerical weirdness because of the construction of the while loop that has a small-but-not-that-small number in the logic: do while((arealayer(i)-currentPatch%area) > 0.000001_r8) so that once these sorts of things do occur, this is a likely place for them to show up, but if a youngest patch were sufficiently small it could still crash things elsewhere. So I sort of think that leaving the 9d0dd5d changes in place at least leaves evidence in the log file that things are headed in a bad direction which might make debugging other problems easier.

But If I do leave the logic in place, I am still worried that by avoiding the whole block of code, key variables are not being set that could produce other types of weirdness. Is there anything that must be set there, and if so, what is the least dangerous value to set it to?

rgknox commented 8 years ago

Back from panel, trying to catch up to you guys again! @ckoven I don't know if you have answered your early question "why is this happening? shouldn't the min_patch_area logic prevent small patch areas from occurring?"

The parameter min_patch_area is newly introduced. It is possible that there are locations in the code where patch area is compared to a hard-coded threshold that should be re-named min_patch_area, but I did not catch it when adding the parameter.

I think we also need fates constant file for stuff like this.

ckoven commented 8 years ago

@rgknox the reason it was happening was the logic that I changed in bb8ac55. it mostly isn't happening anymore with that change in place.

rgknox commented 8 years ago

sorry but I have to ask, what about the "mostly" part?

ckoven commented 8 years ago

like I said above, if the youngest patch on a site is tiny it still doesn't get merged. in practice this doesn't appear to occur very often, though in theory it can. hence my inclination to leave that check in place for the canopy construction code to avoid issues there.

NGEET / fates