Divide by zero bug - Githubissues

dabail10 commented 3 years ago

There is the slight chance in icepack_therm_itd.F90 that this code will result in a divide by zero in the slope calculation in the case that hicen_init(n+1) ~= hicen_init(n). This does not arise very often. Need to add a check here before dividing.

     if (hicen_init(n)   > puny .and. &
         hicen_init(n+1) > puny) then
         ! interpolate between adjacent category growth rates
         slope = (dhicen(n+1) - dhicen(n)) / &
             (hicen_init(n+1) - hicen_init(n))
         hbnew(n) = hin_max(n) + dhicen(n) &
                  + slope * (hin_max(n) - hicen_init(n))
     elseif (hicen_init(n) > puny) then ! hicen_init(n+1)=0
         hbnew(n) = hin_max(n) + dhicen(n)
     elseif (hicen_init(n+1) > puny) then ! hicen_init(n)=0
         hbnew(n) = hin_max(n) + dhicen(n+1)
     else
         hbnew(n) = hin_max(n)
     endif

dabail10 commented 3 years ago

Some more context here. This can only happen when hicen_init(n+1) ~= hicen_init(n) ~= hin_max(n). So, you are running into a situation with 0 / 0. MY proposed fix would be as follows:

denom = max(puny,hicen_init(n+1) - hicen_init(n)) slope = (dhicen(n+1) - dhicen(n)) / denom

thoughts? @eclare108213 @proteanplanet

proteanplanet commented 3 years ago

I agree this problem should be mitigated, unlikely as it is to occur. The solution suggested above is accuracy degrading, so an alternative needs to be considered.

dabail10 commented 3 years ago

Can you suggest an alternative that is not accuracy degrading? We use puny as the criteria for something being "zero" everywhere else.

dabail10 commented 3 years ago

eclare108213 commented 3 years ago

I'm okay with using puny, since this is a very rare situation, but I agree with @proteanplanet that there probably is a more elegant way to fix it by changing the structure of the logic.

proteanplanet commented 3 years ago

I was up late last night working on a solution, and I'm close to posting it here.

dabail10 commented 3 years ago

I have a page of math where I was trying to figure out the limit behaviour as we approach 0 / 0. I even did a calculation in NCL with some typical numbers. For example in the case:

hin_max = fspan(1,5,5) hbnew1 = hin_max 0.0 hbnew2 = hin_max 0.0

dhice1 = (/0.0,0.1,0.2,0.3,0.4/) dhice2 = dhice1

hicen_init1 = (/0.5,1.5,2.5,3.5,4.5/) hicen_init2 = (/0.5,1.9999999,2.0000001,3.5,4.5/)

With the hicen_init1, then the slope for category two is (0.2-0.1)/(2.5-1.5) or 0.1. So, hbnew would be 2 + 0.1 + 0.1x0.5 or 1.15. For category three this is (0.3-0.2)/(3.5-2.5) or 0.1 and hbnew is 3 + 0.2 + 0.1x0.5 or 3.25.

With hicen_init2, we get 2.2 and 3.2666667 for hbnew or 2 + 0.1 + 2x0.5 and 3 + 0.2 + 0.1x(1/1.49999999)

So, that was why I ended up just using puny. I just couldn't figure out what the formula in the limit of 0 / 0 would be. Hopefully Andrew has better luck.

eclare108213 commented 3 years ago

The formatting in your previous comment is a little messed up, @dabail10 but nevertheless, this may be part of the reason (or just a symptom) of our not being able to get the results to converge with increasing numbers of thickness categories. This is a harder problem to solve -- I worked on it a few years back (that's what the asymptotic itd option was for), and didn't figure it out. I'll alert @whlipscomb who might have a suggestion.

dabail10 commented 3 years ago

My mind is a bit messed up too ... :) I changed the asterix symbols to x and I believe it looks better.

proteanplanet commented 3 years ago

Part of the problem is that the calculation is being done as a mean thickness. My solution does not make the conversion and therefore deals purely in v and a, rather than h.

proteanplanet commented 3 years ago

I believe the stability criteria is:

aicen(n+1) * aicen(n) * ( vicen_init(n+1)*aicen_init(n) - vicen_init(n)*aicen_init(n+1) ) > puny

The item within the first conditional can be rewritten as:

a1(n) = aicen_init(n+1) * aicen_init(n) * ( vicen(n+1)*aicen(n) - vicen(n)*aicen(n+1) )

a2(n) = aicen(n+1) * aicen(n) * ( vicen_init(n+1)*aicen_init(n) - vicen_init(n)*aicen_init(n+1) )

hbnew(n) = hicen(n) +  (a1(n)/a2(n))*((hin_max(n) - hicen_init(n))

Therefore there is a timestep criterion that is currently not being accounted for, since stability is dependent both on init areas and subsequent areas. I would also not be surprised if this contributes to non-convergence with increasing ncat, since

aicen(n+1)*aicen(n)*aicen_init(n)

and

aicen(n+1)*aicen(n)*aicen_init(n+1)

decrease with increasing ncat.

This is the basis, I believe, for reformulating the numerics.

dabail10 commented 3 years ago

I have to mull this over, but I'm not sure it solves the problem. What happens as a2(n) approaches zero? What should hbnew(n) be when a2(n) < puny?

proteanplanet commented 3 years ago

I have to mull this over, but I'm not sure it solves the problem. What happens as a2(n) approaches zero? What should hbnew(n) be when a2(n) < puny?

I'm thinking on that at the moment. However, the most disturbing part is the timestep dependence that we have been neglecting. I suspect there is a CFL-like criterion that has been missed.

whlipscomb commented 3 years ago

Yes, there is a CFL-like criterion here. To make up some numbers: Say we have category boundaries at 1 m , 2 m, 3 m, etc., and initially the ice thickness lies at the midpoint of each category: 0.5 m, 1.5 m, 2.5 m, etc. Now suppose we take a long time step, say 6 months from September to March, and the new thicknesses in categories 1 and 2 are 1.9 m and 1.8 m, respectively, so that now the category 1 ice is thicker than the category 2 ice. Then the algorithm will no longer work. This is like taking an excessive time step and having trajectories cross in a 1D advection problem.

Maybe the easiest way to state the stability criterion is that for each n up to ncat-1, we have hicen(n+1) - hicen(n) > dhcat_min, where hicen(n) = hicen_init(n) + dhice(n), and dhcat_min is a CFL-related threshold that we choose. You might be able to get away with dhcat_min = puny, but it would be safer to choose something like cfl_frac * (hin_max(n+1) - hin_max(n)), where cfl_frac is a parameter in the range (0, 1).

To be safe, without unnecessarily limiting the time step, you might choose cfl_frac of ~0.1 to 0.5. Say cfl_frac = 0.1. Then if the category boundaries are 1 m, the new values of hicen in adjacent categories must differ by at least 0.1 m.

When this stability criterion fails, I don't think there is a good solution other than to reduce the time step.

I should have put such a check in the code a long time ago. Sorry about that.

proteanplanet commented 3 years ago

Thanks @whlipscomb. That also means that the non-convergence of increasing ncat comes into play. The larger ncat, the smaller the required timestep. I appreciate this is stating the obvious, but it doesn't hurt to think out loud on some occasions.

proteanplanet commented 3 years ago

@dabail10, @eclare108213, @whlipscomb: Based on this discussion, I'm going to rewrite the chunk of code at the head of this issue, and post it here this week for consideration and possible testing.

whlipscomb commented 3 years ago

Sounds good. I'm entering this discussion late, but I'm curious as to why the code doesn't converge with increasing ncat. I would expect that it would converge, if the category boundaries are defined sensibly and the algorithms are working as intended, without large changes in behavior associated with small changes in ice thickness.

That's a big 'if', though. Do you suspect that the ITD redistribution algorithm is behaving badly (and in a non-converging way) because there are CFL violations in thickness space when ncat is large? And maybe those CFL violations have gone undetected until now because the code neither crashed nor diagnosed the violations?

dabail10 commented 3 years ago

This started because the NOAA UFS test suite turned up a case with divide by zero where hicen_init(n+1)-hicen_init(n) is going to zero. This can only happen when the hicen_init(n+1) ~= hice_init(n) ~= hin_max(n). That is they are both very close to the category boundary.

whlipscomb commented 3 years ago

I was thinking that if there are cases where hicen_init(n+1) ~= hicen_init(n) for a particular test, then there might also be cases where hicen_init(n+1) < hicen_init(n). This would be an algorithmic error (i.e., a bug), because the denominator would be negative. Such an error might be missed in testing because there's no logical trap for it. So we would need some additional logic.

I'm not sure I understand what you're saying about both values being close to hin_max(n), because hin_max(n) doesn't appear in the denominator.

proteanplanet commented 3 years ago

I'm entering this discussion late, but I'm curious as to why the code doesn't converge with increasing ncat. I would expect that it would converge, if the category boundaries are defined sensibly and the algorithms are working as intended, without large changes in behavior associated with small changes in ice thickness.

Only if the thermodynamic timestep is decreased sufficiently to match increasing ncat - something that I believe was not included in the convergence tests.

dabail10 commented 3 years ago

No, but the full expression has (hin_max(n) - hicen_init(n)) * slope where slope = (dhice(n+1) - dhice(n)) / (hicen_init(n+1) - hicen_init(n)). Since, this is hicen before the changes from dhice, I don't think you would have the situation where hice_init(n+1) < hicen_init(n). I guess I was assuming that both hicen_init(n+1) and hicen_init(n) would be converging together and hence very close to hin_max(n). Perhaps I am missing some cases.

proteanplanet commented 3 years ago

@dabail10 According to my algebra, the coefficient (hin_max(n) - hicen_init(n)) does not affect stability. It is the (a1(n)/a2(n) - 1) term that matters here. I'll post what I believe is the solution in the coming days once I've finished exploring it in Mathematica.

whlipscomb commented 3 years ago

Sorry, @dabail10, I've been thinking about this in the wrong way and probably adding confusion. There is a potential CFL issue when there is a long time step with category boundaries close together. But as you say, that's not the problem here, because the denominator contains hicen_init, not hicen after the changes.

Now I'm trying to imagine how hicen(n) and hicen(n+1) could approach hin_max from opposite directions, due to ice thickening in category n and thinning in category n+1. In the case where the divzero occurs, is it possible to write out diagnostics (e.g., aicen, vicen and/or hicen for each n) in the time steps before the divzero?

whlipscomb commented 3 years ago

@proteanplanet, I'll be interested to see your solution.

dabail10 commented 3 years ago

I will ask the NOAA folks to help me with that. This has only come up in one of their test cases so far. So, it doesn't happen very often.

whlipscomb commented 3 years ago

It would be useful also to see snow depth and dhice for each category. I'm thinking that dhice might be less smooth than I was imagining. Ideally, all ice and snow properties in two adjacent categories would approach each other as ncat becomes large. But there's nothing in the code to guarantee that snow depths and vertical temperature profiles in two neighboring categories are similar. So dh/dt (i.e., the thermodynamic growth/melt rate) might not be a continuous function of h in the limit of large ncat. You could have three adjacent categories, say, where the sign of dh/dt changes from + to - to +. This could be bad for the ITD numerics.

I could go on speculating, but probably better to wait and see some diagnostics.

proteanplanet commented 3 years ago

I will speculate that I suspect this issue is the cause of a number of crashes elsewhere in the sea ice code that have been difficult to diagnose in RASM and E3SM. That the denominator rarely equals zero has probably masked the bigger issue.

eclare108213 commented 3 years ago

Now I'm trying to imagine how hicen(n) and hicen(n+1) could approach hin_max from opposite directions, due to ice thickening in category n and thinning in category n+1.

Could this be possible if cat n is growing and cat n+1 doesn't have any ice yet, and then suddenly it does due to redistribution from cat n? Seems like that ought to happen all the time, not rarely.

Regarding lack of convergence as ncat increases, @whlipscomb this was something that I worked on years ago and couldn't figure out - you were still at LANL and providing suggestions to me then. At the time, I considered the problem to be coming from the ridging scheme rather than the general ITD redistribution/remapping. If you (or anyone) wants to explore this issue further, I can get my (hand-written!) notes from my office to share.

whlipscomb commented 3 years ago

@eclare108213, this is looking to be an interesting and dangerous problem. (Dangerous in the sense that I could spend a month on it when I have lots of other things I need to work on). But I'll add a few more thoughts:

In general, we don't want hicen(n+1) < hicen(n) after the thermodynamics and before the ITD remapping. This could happen if there is ice growing in category n and melting in category n+1, and the time step is too long. I suggest adding logic to abort the code when this happens. I'm interested in whether that would prevent this particular issue (with hicen approaching hin_max from both sides).
If dh/dt is not a continuous function of h (e.g., because of small, quasi-random differences in snow depth between adjacent categories), then reducing dt might not fix the problem. If both category n and category n+1 are piling up ice near hin_max(n), then no matter how short the time step, the trajectories could eventually cross.
So if we reduce the time step and still get hicen(n) =~ hicen(n+1) =~ hin_max(n), a sensible thing might be to combine the two categories into one. In other words, conservatively combine their aicen, vicen, vsnon, etc. Compute the new hicen, which will be just greater or just less than hin_max(n). Put the ice in category n or n+1 depending on which side of the boundary it falls, and let the other category be empty (and then slowly refill).
I still don't like the idea that hicen could approach hin_max(n) from both sides. Other processes, such as growth and melting in other nearby categories, and ridging of ice into each category, should prevent this from happening. So we should try to figure out the details of how two categories can collapse on each other, and see if we can prevent it.

eclare108213 commented 3 years ago

I agree with @whlipscomb that diagnostics from the NOAA case would be helpful. Since it's the only confirmed case we have at the moment, it would be nice to see how suggested code changes from @proteanplanet affect it, or even if implementing some tests that would cause aborts, as suggested by @whlipscomb, would have caught this earlier.

dabail10 commented 3 years ago

Thanks to @DeniseWorthen for adding some print statements. This is very eye-opening: n n+1 271: hin_max 1 0.6445072 1.391433
271: aicen_init 1 0.3913089E-02 0.1934461E-01 271: vicen_init 1 0.7826178E-03 0.3868922E-02 271: hicen_init 1 0.2000000 0.2000000
271: dhicen 1 0.1070497E-02 0.1149972E-02 271: denom 1 0.000000

So, we have a case where vicen_init / aicen_init is precisely the same for category n and n+1, in this case 1 and 2. They are not converging to hin_max as I had speculated. So, hopefully this helps.

whlipscomb commented 3 years ago

@dabail10, Thanks, that's not what I was expecting. The obvious question is how we wrongly get thin ice in category 2 (just 0.2 m, when the min for this category is 0.64 m). Also, I'm curious under what circumstances CICE would generate an ice thickness of exactly 0.2 m.

Would it be possible to see similar diagnostics at several points of the preceding time step? The ITD failure seems to be a symptom of an earlier bug that puts ice in the wrong category.

dabail10 commented 3 years ago

I was definitely surprised. I agree that this seems to be a problem. I was thinking that aicen/vicen could evolve in away that somehow the hicen in two adjacent categories could become the same. I'll take a look in the log files earlier for this point.

Dave

proteanplanet commented 3 years ago

@dabail10. Is this a forward model, or are they assimilating?

dabail10 commented 3 years ago

I do not believe this run has data assimilation. However, @DeniseWorthen can answer that for sure.

dabail10 commented 3 years ago

I have confirmed there is no data assimilation, but I believe this is an initialization bug. Here is the print from the first timestep:

271: hin_max 1 0.6445072 1.391433
271: aicen_init 1 0.3843437E-01 0.1427963
271: vicen_init 1 0.7686874E-02 0.3224640E-01 271: hicen_init 1 0.2000000 0.2258210
271: dhicen 1 0.1100163E-02 0.1114989E-02 271: denom 1 0.2582099E-01 271: slope 1 0.5741749E-03

So, it looks like the category values are problematic from the start. Note that I used a CFS (old NOAA forecast system) initial state from the NOAA Sea Ice Simulator (SIS1) model and did some manipulation to get it to work in CICE. It appears that the category information was not translated correctly. However, this does still point to the very remote possibility of a zero denominator. So, I think it is not worth re-designing this code for this purpose. I think we should just stick with my proposed fix making sure the denominator is at least puny. Thoughts?

proteanplanet commented 3 years ago

I definitely do want to address the issue, even though it's not the cause here. Both in RASM and E3SM, there have been many occasions of undiagnosed crashing in the thermodynamics code that have been attributed to other instabilities, but never proved. Knowing this remains an issue is problematic for being able to rule it out in other circumstances.

whlipscomb commented 3 years ago

I suggest adding a couple of logical tests at the start of ITD remapping, or perhaps earlier in the code:

(1) Make sure that for categories 1 to ncat, we have hicen_init (i.e., ice thickness before thermodynamic growth/melt) within the category bounds defined by hin_max. This would catch the initialization error described here.

(2) After thermodynamic melt/growth, make sure that for n = 1 to ncat-1, we have hicen(n) < hicen(n+1). This would catch potential CFL violations associated with a large number of categories close together, and a too-long time step.

I think it's sensible to change the ITD code to prevent a zero denominator, since I wouldn't categorically rule out the possibility that hicen_init(n) is very close to hicen_init(n+1). But it's also important to flag ITDs that violate the assumptions of the algorithm and might indicate problems elsewhere in the code.

proteanplanet commented 3 years ago

@whlipscomb, @dabail10, @eclare108213: Having looked through the code, and given this further thought, I think the best approach may be to add a simple catch at the place where this problem was detected. The bigger CFL-type issue appears to be catered for by reverting to rebinning. Actually fixing this is a science project. Hopefully, the draft PR https://github.com/CICE-Consortium/Icepack/pull/337 suffices at this stage.

CICE-Consortium / Icepack

Divide by zero bug #333