Logging event code > 10000 is not working due to mass balance error

sharma-bharat commented 1 year ago

We (Dr. Anthony Walker and I) were trying to perform logging on a specific day (icode > 10000) and we are getting a mass balance error, as follows:

Logging Event Enacted on date: 12-01-1855
 mass balance error detected
 element type (see PRTGenericMod.F90):            1
 error fraction relative to biomass stock:    0.0000000000000000
 absolut error (flux in - change):                        NaN
 call index:            3
 Element index (PARTEH global):           1
 net:   -148065.34787676064
 dstock:                        NaN
 seed_in:    5.4794520547945195E-002
 net_root_uptake:   -2.7694997765899812E-016
 gpp_acc:    43.792761064199176
 flux_generic_in:    0.0000000000000000
 wood_product:    148069.15841267366
 error from patch resizing:    0.0000000000000000
 burn_flux_to_atm:    0.0000000000000000
 seed_out:    0.0000000000000000
 flux_generic_out:    0.0000000000000000
 frag_out:    14.546684805935037
 aresp_acc:    25.490334865775818
 error=net_flux-dstock:                       NaN
 biomass   2913.8406328922529
 litter                       NaN
 seeds   237.54743874994409
 total stock                       NaN
 previous total   355119.52203159698
 lat lon   35.899999999999999        275.66669999999999

In the log file, it looks like the abs error is a NaN. That would likely be causing the mass balance error.

Could you please look into this issue and help us resolve it?

Thanks, Bharat

glemieux commented 1 year ago

@sharma-bharat the trigger for this error is due to the fact that the error is NaN which is one of the checks that generates this endrun call. The source of the NaN is the litter_stock which is determined by a call to PatchMassStock. This variable is just a summation of the current patch litter times the patch area. I'm guessing there must be some issue in the litter initialization. @rgknox any initial thoughts?

Can you provide some more details about the case you are running and the particular versions of fates?

rgknox commented 1 year ago

The error triggers immediately after the event, which suggests to me that the problem is directly tied to the event itself.

Here is the call to the balance check, which the log indicated, was call index 3: https://github.com/NGEET/fates/blob/sci.1.61.0_api.25.0.0/main/EDMainMod.F90#L284

Note that the check fails right after spawn_patches. So there must be some problem with the litter (also noting what @glemieux said) being generated from the disturbance event.

Here is the call to litter generation from a logging event: https://github.com/NGEET/fates/blob/sci.1.61.0_api.25.0.0/biogeochem/EDPatchDynamicsMod.F90#L608-L609

I would put print statements in that routine (or run a debugger), wherever we update the new_litter% and cur_litter% terms. The print statement will very often fail if trying to print a nan or inf, so keep that in mind too. Also, updating the xml file in your case director to debug mode might help:

./xmlchange DEBUG=True

Here is the first place that the litter gets updated: https://github.com/NGEET/fates/blob/sci.1.61.0_api.25.0.0/biogeochem/EDLoggingMortalityMod.F90#L607-L610

JoshuaRady commented 1 year ago

I'm not sure if this is relevant but logging events will throw a mass balance error with values of fates_mort_disturb_frac other than 1.0. @jkshuman and I identified this some time ago. Unfortunately, I never completed an issue for this problem as promised in issue #714. I could look back though my notes for more details if you think this could be related to your issue.

ckoven commented 1 year ago

@JoshuaRady thanks, that's an interesting point. In principle the fates_mort_disturb_frac parameter probably shouldn't apply to logging. FATES should probably assume that all logged area is disturbed, just as it also assumes that all burned area is disturbed.

walkeranthonyp commented 1 year ago

A comment in response to immediately above, I don't think fates_mort_disturb_frac is applied to logging, see here. So perhaps someone has changed that in the interim. In any case it was worth looking into and we get logging event code crashes with fates_mort_disturb_frac ranging from 0 to 1 (for certain combinations of logging related parameters).

walkeranthonyp commented 1 year ago

@sharma-bharat has been digging into the logging code and logging event failures that cause the code to crash. Our heads are in knots a little but I think we're getting an idea of what's going on.

The proximal issue is that with a single or annual logging event disturbance_rate here is 1. That means the new patch area is equal to the donor patch area. So the old patch area (remainder_area) becomes zero and when partitioning litter fluxes you get a divide by zero giving a NaN and throwing the error that crashes the model.

This can be fixed by writing a catch into the litter partitioning logic that zeros litter fluxes when remainder_area == 0 rather than calculating them.

I think that's worth doing for robustness but it doesn't answer why we get disturbance_rate == 1.0 when our logging parameter disturbance fractions sum < 1. I think we've traced it to the interface with the hlm/luh harvesting code. If the HLM input data requires logging I think it passes a switch and a harvest_rate parameter that defines the harvested/human disturbed area fraction. We then calculate the direct, collateral, mechanical logged disturbance by multiplying lmort_direct, lmort_collateral, lmort_infra each by harvest_rate and then to be sure the human disturbed area from the HLM is maintained we calculate the degraded fraction as l_degrad = harvest_rate - (lmort_direct + lmort_infra + lmort_collateral).

This is all good but when we initiate an annual logging event from the parameter file, we set harvest rate to 1. Combined with the above disturbance fraction logic, which remains the same, this means that when we initiate an annual logging event all of our area is disturbed.

I don't think that's the logic we want right? I'm not sure what degradation disturbance is supposed to represent in the case where we initiate a logging event from the param file. Seems like collateral and mechanical should cover that?

I think it's a fairly easy fix, I just don't know exactly what logic we want. The fix can go here and look something like:

       if (canopy_layer .eq. 1 .and. hlm_use_lu_harvest == itrue) then
            l_degrad = harvest_rate - (lmort_direct + lmort_infra + lmort_collateral) ! fraction passed to 'degraded' forest.
         else
            l_degrad = 0._r8
         endif

Or add an additional case to represent the degraded disturbance logic for a logging event initiated from the parameter file.

TL;DR: annual logging event from the param file causes all of a patch to be disturbed, resulting in a divide by zero in the new patch generation logic. Can fix this by 1) changing the harvest_rate / degraded land logic for a parameter file initiated logging event and 2) making the new patch creation logic robust to a zero remainder area (poss. only during a logging event).

@ckoven @rgknox @jkshuman @glemieux @JoshuaRady -- let us know what you think and we'll implement the fixes. At the least fix 1) poss 2).

ckoven commented 1 year ago

I don't think that's the logic we want right? I'm not sure what degradation disturbance is supposed to represent in the case where we initiate a logging event from the param file. Seems like collateral and mechanical should cover that?

@walkeranthonyp I think the key thing is to distinguish what is meant by each of the terms here. harvest_rate is meant to be the total area (over the timestep) that the harvest treatment is applied over, lmort_direct is the fraction of trees within the area harvested that are harvested, likewise for lmort_infra and lmort_collateral.

l_degrad is meant to be the remaining part of harvest_rate where the plants aren't killed for any reason during the harvest event, but the land becomes considered as degraded forest. Another way of thinking about this is that a logging treatment shouldn't result in any change in the overall composition of the patches that remain unharvested, their area should just decrease to be 1- harvest_rate times their prior area. The separation of patch area lumped into the l_degrad term is what allows this even in the case of a selective harvest, otherwise all the unharvested trees within the harvest area would still be left behind in the unharvested patches that remain and so the composition of the unharvested patches would change, in addition to their area.

In the case of the event-based logging, since the assumption is that it is applied to the entire area, I can see how the l_degrad logic may be unnecessary: all of the land should then be considered degraded forest because it was all within the harvest treatment area. So I think I support what you are proposing, it should revert the behavior back to what it was in https://bg.copernicus.org/articles/17/4999/2020/ .

ckoven commented 1 year ago

@walkeranthonyp I'm still trying to think this through some more in real time. The one concern that occurs to me is the subsequent dynamics due to patch labeling (which wasn't yet a thing in FATES as of https://bg.copernicus.org/articles/17/4999/2020/ ). If you set l_degrad to zero when the event-based logging is enabled, whatever land is not disturbed by the logging (either because it is underneath the crowns of trees that were not killed by direct + infra + collateral, or because the canopy is not closed by woody plants and so it was ground not occupied by any trees to begin with) will still have a patch label of primaryforest, whereas what is logged will be secondaryforest. So then that would mean those areas cannot subsequently be fused. So we may need to come up with an alternate solution that doesn't do that.

walkeranthonyp commented 1 year ago

For what we want to do with FACE simulations we actually want to clear the whole area, so it would all be labeled secondaryforest.

But for the more broad case where not the whole area is disturbed by logging - would what you suggest be a problem? Would it get messy with successive logging events, the code wouldn't know where to apply them or something.

ckoven commented 1 year ago

@walkeranthonyp so long as your canopy is completely closed, I think it should work fine. And yeah, maybe it shouldn't matter in the general case what the labels are at the site scale anyway, they really only get used in the larger-scale sense through the separate drivers for subsequent logging in order to allow large-scale forest rotations. Maybe give it a try and just make sure nothing weird happens?

JoshuaRady commented 1 year ago

As @ckoven confirmed above the addition of l_degrad a few years back means that any logging event from the parameter file will result in 100% of the patch being disturbed, regardless of logging settings. The logging removals are applied and everything that remains is placed into a new patch (1 old patch -> 1 new patch).

If you want to split the patch with logging, the change you propose should allow this but I would proceed with caution. In addition to the primary vs. secondary issue mentioned by @ckoven, there may be other issues. I reviewed the code at some point in the past and I recall I had some concerns that the mass partitioning code might not work as expected if the patch is split. Sorry to be vague but I haven't found my notes on that. I just know I had to do a lot of careful debugging when I implemented patch splitting in the VM Module.

Additionally, when a patch is split after a harvest the canopy can undergo rearrangement that leads to unpleasant and unrealistic side effects (again see issue #714).

JoshuaRady commented 1 year ago

Sorry, my comment above was too slow and @walkeranthonyp's subsequent post renders part it moot.

@walkeranthonyp, I think for what you want the current behavior is what you want. If you change l_degrad= 0 I think you risk splitting the patch if trees don't occupy 100% of the area, which is basically what @ckoven said in reverse.

Clearly the litter divide by zero issue should be fixed.

walkeranthonyp commented 1 year ago

Thanks for the rapid comments @ckoven @JoshuaRady. I'm somewhat agnostic to what we do with harvest_rate. My goal is to get my head around how this code is supposed to work, especially in relation to logging events specified in the parameter file, and try to avoid messing something up.

From your comments and reading of the code, logging enacted from the param file is supposed to affect 100% of the area so that all of a logged patch becomes a new patch. But damage to plants only happens through lmort_direct, lmort_collateral, lmort_infra. All l_degrad does is to bring the part of the patch not damaged by logging into the new patch. Is that accurate (or close enough)? If so I have two questions:

I guess that means PPA can kick in on the new logged patch so whatever trees are not damaged can fill the canopy rapidly. Is that the desired effect?
What happens in the case of HLM driven harvest if/when l_degrad is high? That seems inconsistent with the intention of the land use dataset which I guess assumes that all of the area defined by harvest_rate is damaged in some way - or is l_degrad always low in these simulations and/or lmort_collateral, lmort_infra and lmort_direct are controlled by the HLM / land use dataset?

walkeranthonyp commented 1 year ago

I have a fix for this that's working by catching when remainder_area is equal to zero in logging_litter_fluxes. Looking at the code now, I see there are similar catches for fire_litter_fluxes and mortality_litter_fluxes. I will align my edits with those and make a PR.

My Q's above still stand. Thinking more on this, by setting harvest_rate = 1 the current logging event behaviour is functionally equivalent to setting fates_mort_disturb_frac = 0. Is that the behaviour we're going for? Wouldn't that mean canopy gaps caused by logging could close v rapidly due to gap filling done according to PPA rather than gap-phase regrowth?

glemieux commented 1 year ago

My Q's above still stand. Thinking more on this, by setting harvest_rate = 1 the current logging event behaviour is functionally equivalent to setting fates_mort_disturb_frac = 0. Is that the behaviour we're going for? Wouldn't that mean canopy gaps caused by logging could close v rapidly due to gap filling done according to PPA rather than gap-phase regrowth?

This discussion will continue in #1003.

JessicaNeedham commented 1 year ago

Hi all, I’m on a recent commit of main branch (7f448b09 sci.1.65.2_api.25.4.0) that includes the fix from PR #996 but I’m seeing (nearly) the same error as described above - NaN in mass balance checks following event based logging.

I’m trying to simulate a one time stand clearing disturbance (all trees in all patches dead) and total logging seemed like the easiest way to go.

My fail has call index 4, rather than 3, so it is being triggered here after patch fusion (the TotalBalanceCheck after spawn_patches seems to pass ok).

Looks like a patch with very small negative area was somehow created?

Logging Event Enacted on date: 01-01-1993
See luc mortalities: 1.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000
See luc mortalities: 1.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000
mass balance error detected
element type (see PRTGenericMod.F90): 1
error fraction relative to biomass stock: 0.0000000000000000
absolut error (flux in - change): NaN
call index: 4
Element index (PARTEH global): 1
net: -22.672758641509233
dstock: NaN
seed_in: 219.17808219178085
net_root_uptake: 0.0000000000000000
gpp_acc: 98.886479060820022
flux_generic_in: 0.0000000000000000
wood_product: 0.0000000000000000
error from patch resizing: 1.9561048570576513E-010
burn_flux_to_atm: 0.0000000000000000
seed_out: 0.0000000000000000
flux_generic_out: 0.0000000000000000
frag_out: 257.97276723005825
aresp_acc: 82.764552664247461
error=net_flux-dstock: NaN
biomass 33859.880998928049
litter NaN
seeds NaN
total stock NaN
previous total 721358.60047075246
lat lon 9.2500000000000000 280.25000000000000

patch area: -6.2527760746888816E-013
AG CWD: NaN
BG CWD (by layer): NaN NaN NaN NaN \ NaN NaN NaN NaN NaN NaN
leaf litter: NaN
root litter (by layer): NaN NaN NaN NaN \ NaN NaN NaN NaN NaN NaN
anthro_disturbance_label: 1
use_this_pft: 1 1 1 1

Maybe I missed something in the discussion above. Is it possible to allow fates_landuse_logging_direct_frac = 1.0 and fates_landuse_logging_dbhmin = 0.0?

sshu88 commented 1 year ago

Hi @JessicaNeedham . These options (fates_landuse_logging_direct_frac = 1.0 and fates_landuse_logging_dbhmin = 0.0) work fine in my cases (global fixed_biogeog with no comp).

JessicaNeedham commented 1 year ago

Thanks @sshu88 , that's good to know. I only see this error with the logging event on certain dates in full FATES. I guess a certain patch structure prior to logging must be triggering it.

NGEET / fates

Logging event code > 10000 is not working due to mass balance error #961