geco-bern / rsofun

Implements the Simulating Optimal FUNctioning framework for site-scale simulations of ecosystem processes, including model calibration. It contains Fortran 90 modules for the P-model, SPLASH, and BiomeE models.
https://geco-bern.github.io/rsofun/
GNU General Public License v3.0
25 stars 28 forks source link

Out of bound parameters cause memory leak #69

Closed khufkens closed 2 years ago

khufkens commented 2 years ago

Back to square one on debugging, might be more than a tf_base issue:

khufkens commented 2 years ago

Long short, LM3PPA isn't numerically stable. When you call the same model run x-thousand times (outside an optimization routine, with stable parameters) you will see fluctuations at times (due to random numbers being called somwhere? @stineb Ensheng?)

The issue still revolves around relayering the cohorts, where this generates negative individuals or crown areas. Although the NA values have been addressed, this still gives issues with other "strange" values. The outputs between the two runs below should be identical, but aren't. Resulting in a faulty increment on this line.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L1286

Not sure where the error orginates, but it is clear that negative crown areas should not exist.

[RUN ONE] "-----"
           1           1   5.00000007E-02   6.13197908E-02
           1           1
   0.00000000       3.06598959E-03
           1           1   5.00000007E-02   6.13197908E-02
           1           1
   0.00000000       3.06598959E-03
           1           1   4.94724736E-02   8.25790912E-02
           1           1
   0.00000000       4.08539176E-03
[RUN TWO] "-----"
           1           1   5.00000007E-02   6.13197908E-02
           1           1
   0.00000000       3.06598959E-03
           1           1   5.00000007E-02   6.13197908E-02
           1           1
   0.00000000       3.06598959E-03
           1           1   4.94999737E-02  -1.36484974E-03
           1           1
   1.00006759       1.00000000    
           1           1
           2           1
wengensheng commented 2 years ago

Thank you, Koen! LM3-PPA has chaotic behavior. So, you can see the oscillations of population, partly because of the likely "prey-predator" relationship between top layer and understory layer trees and the logistic growth (maybe, not sure yet). As I can recall, it happened when top layer trees have relatively low mortality rate and there is a large gap between top layer trees and understory trees. This behavior is interesting actually, though I never got time to figure what parameter ranges lead to it. I didn't expect the negative crown area issue. I will check why it happens.

I really appreciate it. Koen. Thank you very much! Ensheng

On Tue, Mar 29, 2022 at 10:14 AM Koen Hufkens @.***> wrote:

Long short, LM3PPA isn't numerically stable. When you call the same model run x-thousand times (outside an optimization routine, with stable parameters) you will see fluctuations at times (due to random numbers being called somwhere? @stineb https://github.com/stineb Ensheng?)

The issue still revolves around relayering the cohorts, where this generates negative individuals or crown areas. Although the NA values have been addressed, this still gives issues with other "strange" values. The outputs between the two runs below should be identical, but aren't. Resulting in a faulty increment on this line.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L1286

Not sure where the error orginates, but it is clear that negative crown areas should not exist.

[RUN ONE] "-----" 1 1 5.00000007E-02 6.13197908E-02 1 1 0.00000000 3.06598959E-03 1 1 5.00000007E-02 6.13197908E-02 1 1 0.00000000 3.06598959E-03 1 1 4.94724736E-02 8.25790912E-02 1 1 0.00000000 4.08539176E-03 [RUN TWO] "-----" 1 1 5.00000007E-02 6.13197908E-02 1 1 0.00000000 3.06598959E-03 1 1 5.00000007E-02 6.13197908E-02 1 1 0.00000000 3.06598959E-03 1 1 4.94999737E-02 -1.36484974E-03 1 1 1.00006759 1.00000000 1 1 2 1

— Reply to this email directly, view it on GitHub https://github.com/computationales/rsofun/issues/69#issuecomment-1081925457, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWR5WWWIF5W7HTYA3QXEHDVCMF27ANCNFSM5R2PQWRQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

khufkens commented 2 years ago

Hi @wengensheng (your handle didn't pop up earlier but you caught it anyway),

You can use my personal fork (not in the CES project but in khufkens). If you compile / install the script after alterations you can see the behaviour if you uncomment the code to remove the negative crown area. There are two scripts calib_model.R (which is an optimization run) and calib_model_2.R which is just a loop calling the same model run over and over (using the same data and parameters). You can use these for testing.

Given the segfault RStudio wil crash so best to run this on a terminal using:

Rscript -e "devtools::install()"; Rscript analysis/calib_model_2.R

This will compile a fresh package (after changes to the fortran code) and run the calibration. I removed all verbose output so you might want to add some of this back in (printing crown area and individuals in the cohort re-layering is most informative).

khufkens commented 2 years ago

Hi @wengensheng,

This line: https://github.com/computationales/rsofun/blob/6f48cdc52bbc45574365c3b6c4a0ced47c8891c0/src/vegetation_lm3ppa.mod.f90#L2209

Is the probable culprit I think.

You define r as random number from a fixed seed above, but the line calls plain old rand() without a fixed seed. This will cause computational variability between runs despite the same drivers and parameters on initiation of the cohorts.

I think the lines should read:

        cx%species = INT(r*5)+1
        cx%nindivs = r / 10. ! trees/m2
        btotal     = r * 100.0  ! kgC /tree

or (with the fixed seed)

        cx%species = INT(rand(rand_seed)*5)+1
        cx%nindivs = rand(rand_seed) / 10. ! trees/m2
        btotal     = rand(rand_seed) * 100.0  ! kgC /tree
wengensheng commented 2 years ago

Thank you, Koen. This section (Lines 2196~2240) can be deleted. This was a test to randomly initialize vegetation if the initial vegetation was not defined. This should not happen. That is, a user must define initial vegetation in the parameter file. In line 2140, we can set "if (read_from_parameter_file) then" must be true, otherwise stop the model.

Ensheng

On Wed, Mar 30, 2022 at 4:52 AM Koen Hufkens @.***> wrote:

Hi @wengensheng https://github.com/wengensheng,

This line:

https://github.com/computationales/rsofun/blob/6f48cdc52bbc45574365c3b6c4a0ced47c8891c0/src/vegetation_lm3ppa.mod.f90#L2209

Is the probable culprit I think.

You define r as random number from a fixed seed above, but the line calls plain old rand() without a fixed seed. This will cause computational variability between runs despite the same drivers and parameters on initiation of the cohorts.

I think the lines should read:

    cx%species = INT(r*5)+1
    cx%nindivs = r / 10. ! trees/m2
    btotal     = r * 100.0  ! kgC /tree

or (with the fixed seed)

    cx%species = INT(rand(rand_seed)*5)+1
    cx%nindivs = rand(rand_seed) / 10. ! trees/m2
    btotal     = rand(rand_seed) * 100.0  ! kgC /tree

— Reply to this email directly, view it on GitHub https://github.com/computationales/rsofun/issues/69#issuecomment-1082803900, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWR5WVBMAX5G2MEYCLI26DVCQI4VANCNFSM5R2PQWRQ . You are receiving this because you were mentioned.Message ID: @.***>

khufkens commented 2 years ago

This traces back to the natural mortality at times reducing either crown area or individuals below 0

Tracing the error in biosphere_lm3ppa.mod.f90 you get a shift to negative values (for certain parameter specs between C and D. After printing C call vegn_nat_mortality( vegn ) is called.

 A: vegn%cohorts(:)%nindivs   5.00000007E-02
 B: vegn%cohorts(:)%nindivs   5.00000007E-02
 C: vegn%cohorts(:)%nindivs   5.00000007E-02
 D: vegn%cohorts(:)%nindivs  -5.00001013E-04
 E: vegn%cohorts(:)%nindivs  -5.00001013E-04
 F: vegn%cohorts(:)%nindivs  -5.00001013E-04
khufkens commented 2 years ago

Two things are changed to fix this issue:

  1. Tolerance values in relayer_cohorts() is evaluated on the non absolute value.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L1295

Negative values larger than the tolerance (if taken as abs()) will trigger the if routine.

  1. The remaining instability (due to parameter choices) is addressed by properly constraining mortality rates. Deathrate in the DBH mortality setup can exceed 1 so you will subtract more individuals than present in the pool - resulting in negative individuals which then trickles down to the relayer routine which fails.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L863

deathrates should always be bound by an upper limit of 1!, i.e.

deathrate = min(1.0, deathrate)

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L826

khufkens commented 2 years ago

Closed as per: https://github.com/computationales/rsofun/pull/71

wengensheng commented 2 years ago

Thank you, Koen. As for the fixation, you can also do (in Line 864): deadtrees = cc%nindivs (1.0 - exp(-deathrate)) This is also the standard way to do continuous mortality. deathrate is a scalar with unit of fraction/time period. If "deathrate" is small, it is very close to "cc%nindivs deathrate". Beni and Laura, Usually, tree mortality rate is small, ranging from 0.01~0.05. Even for the understory seedlings, it is set as around 0.3~0.5/year, which is extremely high. How is it close to 1.0 in here? The line 863, "deathrate = deathrate + 0.01", seems a big hack to me? Why do this? If the "deathrate" was not calculated in those "if-thens", it will be getting greater with cohort numbers

Best, Ensheng

On Thu, Mar 31, 2022 at 9:42 AM Koen Hufkens @.***> wrote:

Two things are changed to fix this issue:

  1. Tolerance values in relayer_cohorts() is evaluated on the non absolute value.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L1295

Negative values larger than the tolerance (if taken as abs()) will trigger the if routine.

  1. The remaining instability (due to parameter choices) is addressed by properly constraining mortality rates. Deathrate in the DBH mortality setup can exceed 1 so you will subtract more individuals than present in the pool - resulting in negative individuals which then trickles down to the relayer routine which fails.

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L863

deathrates should always be bound by an upper limit of 1!, i.e.

`` deathrate = min(1.0, deathrate)

https://github.com/computationales/rsofun/blob/c5dd1712d04872a76530102da5469184b4f33278/src/vegetation_lm3ppa.mod.f90#L826

— Reply to this email directly, view it on GitHub https://github.com/computationales/rsofun/issues/69#issuecomment-1084595214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWR5WW3CLXHXS2C23FIKYLVCWTVPANCNFSM5R2PQWRQ . You are receiving this because you were mentioned.Message ID: @.***>

khufkens commented 2 years ago

Thanks @wengensheng,

Yes, you have a race condition going on indeed, where the increment + 0.01 will get out of hand exceed unity which then implies negative individuals. I've no history on the mortality specifications so I leave sorting this to @lauramarques and @stineb.

stineb commented 2 years ago

Thanks @wengensheng and @khufkens , I have no recollection for why the + 0.01 was introduced. Do you, @lauramarques ? Comparing BiomeE-Allocation, from where we've copied the code, shows that the mortality formulation was changed quite a bit. In particular, the factor deltat/seconds_per_year has been removed? @lauramarques

lauramarques commented 2 years ago

As far as I recall, the deathrate was very low for the first simulations we run and that is why we added the +0.01. Also because of the small deathrate, we simplified the calculation of deadtrees to "cc%nindivs * deathrate".

wengensheng commented 2 years ago

Then, this line (deathrate = deathrate + 0.01) can be removed. The section between lines 724~861 can be cleaned up. There are tests about the ideas of removing the largest trees when total CAI is high.

On Thu, Mar 31, 2022 at 1:16 PM Laura Marqués @.***> wrote:

As far as I recall, the deathrate was very low for the first simulations we run and that is why we added the +0.01. Also because of the small deathrate, we simplified the calculation of deadtrees to "cc%nindivs * deathrate".

— Reply to this email directly, view it on GitHub https://github.com/computationales/rsofun/issues/69#issuecomment-1084874018, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWR5WTRQF4XXBGDC3NHQQLVCXMWPANCNFSM5R2PQWRQ . You are receiving this because you were mentioned.Message ID: @.***>

stineb commented 2 years ago

I see this has been closed, but have resolved the question

_In particular, the factor deltat/seconds_per_year has been removed? @lauramarques_

?