Closed dabail10 closed 3 years ago
Some more context here. This can only happen when hicen_init(n+1) ~= hicen_init(n) ~= hin_max(n). So, you are running into a situation with 0 / 0. MY proposed fix would be as follows:
denom = max(puny,hicen_init(n+1) - hicen_init(n)) slope = (dhicen(n+1) - dhicen(n)) / denom
thoughts? @eclare108213 @proteanplanet
I agree this problem should be mitigated, unlikely as it is to occur. The solution suggested above is accuracy degrading, so an alternative needs to be considered.
Can you suggest an alternative that is not accuracy degrading? We use puny as the criteria for something being "zero" everywhere else.
This is related to https://github.com/CICE-Consortium/CICE/issues/502.
I'm okay with using puny, since this is a very rare situation, but I agree with @proteanplanet that there probably is a more elegant way to fix it by changing the structure of the logic.
I was up late last night working on a solution, and I'm close to posting it here.
I have a page of math where I was trying to figure out the limit behaviour as we approach 0 / 0. I even did a calculation in NCL with some typical numbers. For example in the case:
hin_max = fspan(1,5,5) hbnew1 = hin_max 0.0 hbnew2 = hin_max 0.0
dhice1 = (/0.0,0.1,0.2,0.3,0.4/) dhice2 = dhice1
hicen_init1 = (/0.5,1.5,2.5,3.5,4.5/) hicen_init2 = (/0.5,1.9999999,2.0000001,3.5,4.5/)
With the hicen_init1, then the slope for category two is (0.2-0.1)/(2.5-1.5) or 0.1. So, hbnew would be 2 + 0.1 + 0.1x0.5 or 1.15. For category three this is (0.3-0.2)/(3.5-2.5) or 0.1 and hbnew is 3 + 0.2 + 0.1x0.5 or 3.25.
With hicen_init2, we get 2.2 and 3.2666667 for hbnew or 2 + 0.1 + 2x0.5 and 3 + 0.2 + 0.1x(1/1.49999999)
So, that was why I ended up just using puny. I just couldn't figure out what the formula in the limit of 0 / 0 would be. Hopefully Andrew has better luck.
The formatting in your previous comment is a little messed up, @dabail10 but nevertheless, this may be part of the reason (or just a symptom) of our not being able to get the results to converge with increasing numbers of thickness categories. This is a harder problem to solve -- I worked on it a few years back (that's what the asymptotic itd option was for), and didn't figure it out. I'll alert @whlipscomb who might have a suggestion.
My mind is a bit messed up too ... :) I changed the asterix symbols to x and I believe it looks better.
Part of the problem is that the calculation is being done as a mean thickness. My solution does not make the conversion and therefore deals purely in v and a, rather than h.
I believe the stability criteria is:
aicen(n+1) * aicen(n) * ( vicen_init(n+1)*aicen_init(n) - vicen_init(n)*aicen_init(n+1) ) > puny
The item within the first conditional can be rewritten as:
a1(n) = aicen_init(n+1) * aicen_init(n) * ( vicen(n+1)*aicen(n) - vicen(n)*aicen(n+1) )
a2(n) = aicen(n+1) * aicen(n) * ( vicen_init(n+1)*aicen_init(n) - vicen_init(n)*aicen_init(n+1) )
hbnew(n) = hicen(n) + (a1(n)/a2(n))*((hin_max(n) - hicen_init(n))
Therefore there is a timestep criterion that is currently not being accounted for, since stability is dependent both on init areas and subsequent areas. I would also not be surprised if this contributes to non-convergence with increasing ncat, since
aicen(n+1)*aicen(n)*aicen_init(n)
and
aicen(n+1)*aicen(n)*aicen_init(n+1)
decrease with increasing ncat.
This is the basis, I believe, for reformulating the numerics.
I have to mull this over, but I'm not sure it solves the problem. What happens as a2(n) approaches zero? What should hbnew(n) be when a2(n) < puny?
I have to mull this over, but I'm not sure it solves the problem. What happens as a2(n) approaches zero? What should hbnew(n) be when a2(n) < puny?
I'm thinking on that at the moment. However, the most disturbing part is the timestep dependence that we have been neglecting. I suspect there is a CFL-like criterion that has been missed.
Yes, there is a CFL-like criterion here. To make up some numbers: Say we have category boundaries at 1 m , 2 m, 3 m, etc., and initially the ice thickness lies at the midpoint of each category: 0.5 m, 1.5 m, 2.5 m, etc. Now suppose we take a long time step, say 6 months from September to March, and the new thicknesses in categories 1 and 2 are 1.9 m and 1.8 m, respectively, so that now the category 1 ice is thicker than the category 2 ice. Then the algorithm will no longer work. This is like taking an excessive time step and having trajectories cross in a 1D advection problem.
Maybe the easiest way to state the stability criterion is that for each n up to ncat-1, we have hicen(n+1) - hicen(n) > dhcat_min, where hicen(n) = hicen_init(n) + dhice(n), and dhcat_min is a CFL-related threshold that we choose. You might be able to get away with dhcat_min = puny, but it would be safer to choose something like cfl_frac * (hin_max(n+1) - hin_max(n)), where cfl_frac is a parameter in the range (0, 1).
To be safe, without unnecessarily limiting the time step, you might choose cfl_frac of ~0.1 to 0.5. Say cfl_frac = 0.1. Then if the category boundaries are 1 m, the new values of hicen in adjacent categories must differ by at least 0.1 m.
When this stability criterion fails, I don't think there is a good solution other than to reduce the time step.
I should have put such a check in the code a long time ago. Sorry about that.
Thanks @whlipscomb. That also means that the non-convergence of increasing ncat comes into play. The larger ncat, the smaller the required timestep. I appreciate this is stating the obvious, but it doesn't hurt to think out loud on some occasions.
@dabail10, @eclare108213, @whlipscomb: Based on this discussion, I'm going to rewrite the chunk of code at the head of this issue, and post it here this week for consideration and possible testing.
Sounds good. I'm entering this discussion late, but I'm curious as to why the code doesn't converge with increasing ncat. I would expect that it would converge, if the category boundaries are defined sensibly and the algorithms are working as intended, without large changes in behavior associated with small changes in ice thickness.
That's a big 'if', though. Do you suspect that the ITD redistribution algorithm is behaving badly (and in a non-converging way) because there are CFL violations in thickness space when ncat is large? And maybe those CFL violations have gone undetected until now because the code neither crashed nor diagnosed the violations?
This started because the NOAA UFS test suite turned up a case with divide by zero where hicen_init(n+1)-hicen_init(n) is going to zero. This can only happen when the hicen_init(n+1) ~= hice_init(n) ~= hin_max(n). That is they are both very close to the category boundary.
I was thinking that if there are cases where hicen_init(n+1) ~= hicen_init(n) for a particular test, then there might also be cases where hicen_init(n+1) < hicen_init(n). This would be an algorithmic error (i.e., a bug), because the denominator would be negative. Such an error might be missed in testing because there's no logical trap for it. So we would need some additional logic.
I'm not sure I understand what you're saying about both values being close to hin_max(n), because hin_max(n) doesn't appear in the denominator.
I'm entering this discussion late, but I'm curious as to why the code doesn't converge with increasing ncat. I would expect that it would converge, if the category boundaries are defined sensibly and the algorithms are working as intended, without large changes in behavior associated with small changes in ice thickness.
Only if the thermodynamic timestep is decreased sufficiently to match increasing ncat - something that I believe was not included in the convergence tests.
No, but the full expression has (hin_max(n) - hicen_init(n)) * slope where slope = (dhice(n+1) - dhice(n)) / (hicen_init(n+1) - hicen_init(n)). Since, this is hicen before the changes from dhice, I don't think you would have the situation where hice_init(n+1) < hicen_init(n). I guess I was assuming that both hicen_init(n+1) and hicen_init(n) would be converging together and hence very close to hin_max(n). Perhaps I am missing some cases.
@dabail10 According to my algebra, the coefficient (hin_max(n) - hicen_init(n)) does not affect stability. It is the (a1(n)/a2(n) - 1) term that matters here. I'll post what I believe is the solution in the coming days once I've finished exploring it in Mathematica.
Sorry, @dabail10, I've been thinking about this in the wrong way and probably adding confusion. There is a potential CFL issue when there is a long time step with category boundaries close together. But as you say, that's not the problem here, because the denominator contains hicen_init, not hicen after the changes.
Now I'm trying to imagine how hicen(n) and hicen(n+1) could approach hin_max from opposite directions, due to ice thickening in category n and thinning in category n+1. In the case where the divzero occurs, is it possible to write out diagnostics (e.g., aicen, vicen and/or hicen for each n) in the time steps before the divzero?
@proteanplanet, I'll be interested to see your solution.
I will ask the NOAA folks to help me with that. This has only come up in one of their test cases so far. So, it doesn't happen very often.
It would be useful also to see snow depth and dhice for each category. I'm thinking that dhice might be less smooth than I was imagining. Ideally, all ice and snow properties in two adjacent categories would approach each other as ncat becomes large. But there's nothing in the code to guarantee that snow depths and vertical temperature profiles in two neighboring categories are similar. So dh/dt (i.e., the thermodynamic growth/melt rate) might not be a continuous function of h in the limit of large ncat. You could have three adjacent categories, say, where the sign of dh/dt changes from + to - to +. This could be bad for the ITD numerics.
I could go on speculating, but probably better to wait and see some diagnostics.
I will speculate that I suspect this issue is the cause of a number of crashes elsewhere in the sea ice code that have been difficult to diagnose in RASM and E3SM. That the denominator rarely equals zero has probably masked the bigger issue.
Now I'm trying to imagine how hicen(n) and hicen(n+1) could approach hin_max from opposite directions, due to ice thickening in category n and thinning in category n+1.
Could this be possible if cat n is growing and cat n+1 doesn't have any ice yet, and then suddenly it does due to redistribution from cat n? Seems like that ought to happen all the time, not rarely.
Regarding lack of convergence as ncat increases, @whlipscomb this was something that I worked on years ago and couldn't figure out - you were still at LANL and providing suggestions to me then. At the time, I considered the problem to be coming from the ridging scheme rather than the general ITD redistribution/remapping. If you (or anyone) wants to explore this issue further, I can get my (hand-written!) notes from my office to share.
@eclare108213, this is looking to be an interesting and dangerous problem. (Dangerous in the sense that I could spend a month on it when I have lots of other things I need to work on). But I'll add a few more thoughts:
In general, we don't want hicen(n+1) < hicen(n) after the thermodynamics and before the ITD remapping. This could happen if there is ice growing in category n and melting in category n+1, and the time step is too long. I suggest adding logic to abort the code when this happens. I'm interested in whether that would prevent this particular issue (with hicen approaching hin_max from both sides).
If dh/dt is not a continuous function of h (e.g., because of small, quasi-random differences in snow depth between adjacent categories), then reducing dt might not fix the problem. If both category n and category n+1 are piling up ice near hin_max(n), then no matter how short the time step, the trajectories could eventually cross.
So if we reduce the time step and still get hicen(n) =~ hicen(n+1) =~ hin_max(n), a sensible thing might be to combine the two categories into one. In other words, conservatively combine their aicen, vicen, vsnon, etc. Compute the new hicen, which will be just greater or just less than hin_max(n). Put the ice in category n or n+1 depending on which side of the boundary it falls, and let the other category be empty (and then slowly refill).
I still don't like the idea that hicen could approach hin_max(n) from both sides. Other processes, such as growth and melting in other nearby categories, and ridging of ice into each category, should prevent this from happening. So we should try to figure out the details of how two categories can collapse on each other, and see if we can prevent it.
I agree with @whlipscomb that diagnostics from the NOAA case would be helpful. Since it's the only confirmed case we have at the moment, it would be nice to see how suggested code changes from @proteanplanet affect it, or even if implementing some tests that would cause aborts, as suggested by @whlipscomb, would have caught this earlier.
Thanks to @DeniseWorthen for adding some print statements. This is very eye-opening:
n n+1
271: hin_max 1 0.6445072 1.391433
271: aicen_init 1 0.3913089E-02 0.1934461E-01
271: vicen_init 1 0.7826178E-03 0.3868922E-02
271: hicen_init 1 0.2000000 0.2000000
271: dhicen 1 0.1070497E-02 0.1149972E-02
271: denom 1 0.000000
So, we have a case where vicen_init / aicen_init is precisely the same for category n and n+1, in this case 1 and 2. They are not converging to hin_max as I had speculated. So, hopefully this helps.
@dabail10, Thanks, that's not what I was expecting. The obvious question is how we wrongly get thin ice in category 2 (just 0.2 m, when the min for this category is 0.64 m). Also, I'm curious under what circumstances CICE would generate an ice thickness of exactly 0.2 m.
Would it be possible to see similar diagnostics at several points of the preceding time step? The ITD failure seems to be a symptom of an earlier bug that puts ice in the wrong category.
I was definitely surprised. I agree that this seems to be a problem. I was thinking that aicen/vicen could evolve in away that somehow the hicen in two adjacent categories could become the same. I'll take a look in the log files earlier for this point.
Dave
@dabail10. Is this a forward model, or are they assimilating?
I do not believe this run has data assimilation. However, @DeniseWorthen can answer that for sure.
I have confirmed there is no data assimilation, but I believe this is an initialization bug. Here is the print from the first timestep:
271: hin_max 1 0.6445072 1.391433
271: aicen_init 1 0.3843437E-01 0.1427963
271: vicen_init 1 0.7686874E-02 0.3224640E-01
271: hicen_init 1 0.2000000 0.2258210
271: dhicen 1 0.1100163E-02 0.1114989E-02
271: denom 1 0.2582099E-01
271: slope 1 0.5741749E-03
So, it looks like the category values are problematic from the start. Note that I used a CFS (old NOAA forecast system) initial state from the NOAA Sea Ice Simulator (SIS1) model and did some manipulation to get it to work in CICE. It appears that the category information was not translated correctly. However, this does still point to the very remote possibility of a zero denominator. So, I think it is not worth re-designing this code for this purpose. I think we should just stick with my proposed fix making sure the denominator is at least puny. Thoughts?
I definitely do want to address the issue, even though it's not the cause here. Both in RASM and E3SM, there have been many occasions of undiagnosed crashing in the thermodynamics code that have been attributed to other instabilities, but never proved. Knowing this remains an issue is problematic for being able to rule it out in other circumstances.
I suggest adding a couple of logical tests at the start of ITD remapping, or perhaps earlier in the code:
(1) Make sure that for categories 1 to ncat, we have hicen_init (i.e., ice thickness before thermodynamic growth/melt) within the category bounds defined by hin_max. This would catch the initialization error described here.
(2) After thermodynamic melt/growth, make sure that for n = 1 to ncat-1, we have hicen(n) < hicen(n+1). This would catch potential CFL violations associated with a large number of categories close together, and a too-long time step.
I think it's sensible to change the ITD code to prevent a zero denominator, since I wouldn't categorically rule out the possibility that hicen_init(n) is very close to hicen_init(n+1). But it's also important to flag ITDs that violate the assumptions of the algorithm and might indicate problems elsewhere in the code.
@whlipscomb, @dabail10, @eclare108213: Having looked through the code, and given this further thought, I think the best approach may be to add a simple catch at the place where this problem was detected. The bigger CFL-type issue appears to be catered for by reverting to rebinning. Actually fixing this is a science project. Hopefully, the draft PR https://github.com/CICE-Consortium/Icepack/pull/337 suffices at this stage.
There is the slight chance in icepack_therm_itd.F90 that this code will result in a divide by zero in the slope calculation in the case that hicen_init(n+1) ~= hicen_init(n). This does not arise very often. Need to add a check here before dividing.