CliMA / ClimateMachine.jl

Climate Machine: an Earth System Model that automatically learns from data
https://clima.github.io/ClimateMachine.jl/latest/
Other
452 stars 78 forks source link

`saturation_adjustment` convergence issues #263

Closed simonbyrne closed 3 years ago

simonbyrne commented 5 years ago

This is causing performance problems, particularly on GPU.

As suggested by @charleskawczynski, we should identify the extremal bounds for all these parameters and confirm the number of iterations required by brute force.

What are the largest possible values we are likely to see for these inputs?

trontrytel commented 5 years ago

This is quite fun. My estimates would be:

0 < ρ < 1.6 kg/m3 0 < q_tot < 0.05 kg/kg -6.5e4 < e_int < 4.5e4 m2/s2

For density to be high it has to be very cold and at low elevation. Taking Verkhoyansk as an example we get the -67.6 C temperature at elevation of only 127 m. That gives an estimated density ~1.52 (see here)

I couldn't find such an obvious record candidate for q_tot. Somewhere near the equator where it's, well, hot and humid. Assuming a brutal case of RH=95 %, p=1000 hPa and T=40 C gives q_tot = 0.05.

Internal energy (eq. 17 in our design docs) in both extreme cases should be dominated by the temperature difference term. In both extreme cases I'm assuming q_t = 0. The coldest temperature on record comes from the Vostok Station in Antarctica and is -89.2C. The hottest measurement comes from California Death Valley Furnace Creek and is 56.7 C. Plugging those two values in we get -64000 m2/s2 and 41000m2/s2. I rounded them up a little to be safe.

ρ should never be equal 0. The actual limit depends on the maximum height we want our model to reach. Following the standard atmosphere profile, at Stratopause (~47km) the air density is 2e-3 kg/m3.

Those 3 parameters are not independent of each other. That could be a problem. It might be better to define at least one of them as a function of the other two?

szy21 commented 5 years ago

I couldn't find such an obvious record candidate for q_tot. Somewhere near the equator where it's, well, hot and humid. Assuming a brutal case of RH=95 %, p=1000 hPa and T=40 C gives q_tot = 0.05.

I would suggest using a higher upper bound for q_tot, if we would like to conduct climate change simulations (e.g., SST+4K, 4xCO2, or even more extreme cases).

trontrytel commented 5 years ago

Whats your experience from your climate change runs? How far we need to go?

szy21 commented 5 years ago

Whats your experience from your climate change runs? How far we need to go?

It really depends on how far we want to push the model (i.e, whether we want to do simulations like SST+10K or 16xCO2). I can check and provide some numbers.

tapios commented 5 years ago

We can get good initial values for the iterations by using the temperature from the previous timestep and the linearized saturation adjustment solution. (Requires linearizing, but is straightforward.)

charleskawczynski commented 5 years ago

Ok, preliminary results on these ranges:

  for q_tot in range(0, stop=10^(-1), step=10^(-2))
  for ρ in range(1e-3, stop=1.6, step=10^(-1))
  for e_int in range(-1e5, stop=1e5, step=10^(3))

Yield:

23629 Success (converged saturation adjustment with 10 iterations) 3055 Fail (non-converged saturation adjustment) 88.55% success rate

24613 Success (converged saturation adjustment with 100 iterations) 2071 Fail (non-converged saturation adjustment) 92.23% success rate

I haven't tried plotting this yet but, by inspection, the failed cases don't appear to fall into any pattern-like cases.

charleskawczynski commented 5 years ago

This is causing performance problems, particularly on GPU.

Are we sure that this is a performance issue related to saturation_adjustment? Right now, maxiters=10, which doesn't seem too large.

jkozdon commented 5 years ago

Are we sure that this is a performance issue related to saturation_adjustment?

It's unclear what the issue is. Something seems to be going awry in the moist physics stuff on the GPU. @lcw and I are looking at this now...

charleskawczynski commented 5 years ago

I made some type-restrictions in #341 and #345. It might be nice to re-test this after its merged, since MoistThermo may have had some type-unstable behavior.

tapios commented 5 years ago

Is convergence still an issue? I had on my to-do list (a) to get rigorous and tighter bounds for the nonlinear solve, (b) work out a good linearized version. Would that still be helpful?

charleskawczynski commented 5 years ago

I'm not sure if there's still a performance issue, @lcw @jkozdon any updates? But maybe we should leave the issue until the ranges of inputs mentioned above are further analyzed. I think the tighter bounds and linearized version may be helpful (especially if it can improve the success rate of convergence in the range we're interested in).

charleskawczynski commented 4 years ago

I've been reviewing this task a bit this morning, and I believe the best way forward is to change variables from internal energy to temperature to define our bounds. We already impose non-linear behavior on temperature for the unsaturated case / initial guess:

T_1 = max(FT(T_min), air_temperature(e_int, PhasePartition(q_tot))) # Assume all vapor

If we know that we don't want temperatures to be lower than T_min, we might as well use this, and not have two conditions which sometimes overlap. We can determine internal energy from internal_energy and assume all ice:

e_int = internal_energy(T, PhasePartition(q_tot,FT(0),q_tot))

Doing this seems to improve the convergence region, which I'm testing in #551, from 93.2% to 98.7% for

  q_tot_range = range(0, stop=10^(-1), length=10);
  ρ_ref_range = range(1e-3, stop=1.6, length=10);
  T_range     = range(T_min, stop=T_max, length=10);

And we still hit

min(e_int_range...) = -88368.04185088024
max(e_int_range...) = 588637.2702907145
tapios commented 4 years ago

@charleskawczynski Yes, this makes sense. Are these convergence tests with Newton iterations or secant method? If the former, I suspect convergence issues occur near the freezing point, where temperatures may oscillate slightly (same may be true for secant method). We could improve this by taking the discontinuity in phase partitioning at the freezing temperature into account. However, just limiting the number of iterations will probably be fine for practical purposes.

tapios commented 4 years ago

Maybe a better way to summarize error statistics is to compute the temperature error \Delta T after saturation adjustment and summarize rms(\Delta T), max(|\Delta T|) etc., and plots \Delta T vs T. If the errors are near freezing, the temperature errors may be small, just not quite small enough to satisfy the convergence criteria.

charleskawczynski commented 4 years ago

Are these convergence tests with Newton iterations or secant method?

Secant method. I've also added the same tests with Newton method using analytic gradients.

I also tried the auto-diff, but it doesn't seem to like one of the functions that takes PhasePartition:

internal_energy_sat(::ForwardDiff.Dual{ForwardDiff.Tag{getfield(CLIMA.MoistThermodynamics, Symbol("##33#34")){Float64,Float64,Float64},Float64},Float64,1}, ::Float64, ::Float64)

cc @simonbyrne

I've generated some plots in this branch, which you can generate using include("test/Common/MoistThermodynamics/thermo_state_iter_sensitivity.jl"), to compare the convergence between secant and Newton method.

Overview

#------------------------------- Pass/fail rate for PhaseEquil Newton Method
min(e_int_range...) = -88368.04185088024
max(e_int_range...) = 588637.2702907145
length(SOL_max_iter) = 1000 # number of thermodynamic states initialized
sum(getproperty.(SOL_max_iter, :converged)) = 945
sum(getproperty.(SOL_max_iter, :converged)) / length(SOL_max_iter) = 0.945
1 - sum(getproperty.(SOL_max_iter, :converged)) / length(SOL_max_iter) = 0.05500000000000005
#-------------------------------
#------------------------------- Pass/fail rate for PhaseEquil Secant Method
min(e_int_range...) = -88368.04185088024
max(e_int_range...) = 588637.2702907145
length(SOL_max_iter) = 1000 # number of thermodynamic states initialized
sum(getproperty.(SOL_max_iter, :converged)) = 990
sum(getproperty.(SOL_max_iter, :converged)) / length(SOL_max_iter) = 0.99
1 - sum(getproperty.(SOL_max_iter, :converged)) / length(SOL_max_iter) = 0.010000000000000009
#-------------------------------

Error history of converged cases

Newton method converged_err_histories_PhaseEquil_NewtonMethod

Secant method converged_err_histories_PhaseEquil

Error history of non-converged cases

Newton method non_converged_err_histories_PhaseEquil_NewtonMethod Secant method non_converged_err_histories_PhaseEquil

Comments

It seems that Secant method requires fewer iterations to converge, but the convergence in Newtons method seems more smooth for more of the non-converging cases. These results used 40 iterations

These are the input arguments to the thermo constructor that resulted in non-converged cases (args: e_int, q_tot, ρ) for Secant method (since there are fewer):

PhaseEquil{Float64}(33675.86219242131 , 0.05555555555555555, 0.001              )
PhaseEquil{Float64}(30978.859892118286, 0.06666666666666667, 0.001              )
PhaseEquil{Float64}(28281.857591815293, 0.07777777777777778, 0.001              )
PhaseEquil{Float64}(30978.859892118286, 0.06666666666666667, 0.35633333333333334)
PhaseEquil{Float64}(28281.857591815293, 0.07777777777777778, 0.35633333333333334)
PhaseEquil{Float64}(25584.85529151227 , 0.08888888888888889, 0.35633333333333334)
PhaseEquil{Float64}(22887.852991209307, 0.1                , 0.35633333333333334)
PhaseEquil{Float64}(106201.67235229546, 0.07777777777777778, 0.17866666666666667)
PhaseEquil{Float64}(104955.43533600273, 0.08888888888888889, 0.17866666666666667)
PhaseEquil{Float64}(106201.67235229546, 0.07777777777777778, 0.35633333333333334)
trontrytel commented 4 years ago

Having such low values of air density combined with such high values of q_tot is very unlikely(?). Maybe we should constrain better the range of initial values we want to ensure convergence for? We could test for values drawn around some assumed generic profiles with height of density, moisture and energy. This way for low densities we would have only low values of q_tot?

charleskawczynski commented 4 years ago

Having such low values of air density combined with such high values of q_tot is very unlikely(?). Maybe we should constrain better the range of initial values we want to ensure convergence for? We could test for values drawn around some assumed generic profiles with height of density, moisture and energy. This way for low densities we would have only low values of q_tot?

I like this idea as it would test the thermodynamic states for more common states, however, the manifold over which we want to guarantee convergence is not trivial. We could write a function that defines this manifold, and then offer it to users to determine if a particular set of arguments is guaranteed to result in a converged thermo state or not (this could also be used as a gate-keeper to initializing thermo states). That said, it would be much more simple and mathematically satisfying to document

Saturation adjustment is guaranteed to converge in
... < e_int < ...
... < ρ < ...
... < q_tot < ...

cc @tapios

tapios commented 4 years ago

Whenever you have more than ~5 iterations with either method, something is wrong. Here's what I'd do to construct a better test space:

charleskawczynski commented 4 years ago

This procedure has been implemented in #581

charleskawczynski commented 4 years ago

We need to extend the thermo-state input test values for the PhaseEquil constructor, which do not converge in 3 iterations when running #627 with 3 iterations. To find more than just 1 set of values, run #627 with 3 iterations and, when it fails, print the values and catch with 5 iterations, which should converge fine.

tapios commented 4 years ago

We need to extend the thermo-state input test values for the PhaseEquil constructor, which do not converge in 3 iterations when running #627 with 3 iterations. To find more than just 1 set of values, run #627 with 3 iterations and, when it fails, print the values and catch with 5 iterations, which should converge fine.

More importantly, we need to find a way to improve the error handling. I suspect the convergence issues arise near numerical instabilities. It would be better to have a way for saturation adjustment to continue to run when maxiter is reached without having achieved the desired convergence criterion but give a warning. For performance reasons, I'd like saturation adjustment to work even with only one iteration, while giving a warning when a given convergence criterion is not achieved.

charleskawczynski commented 4 years ago

More importantly, we need to find a way to improve the error handling.

Agreed.

For performance reasons, I'd like saturation adjustment to work even with only one iteration, while giving a warning when a given convergence criterion is not achieved.

My hope would be that we throw an error when convergence is not achieved, but never need to. All this means is that we need to improve the performance and robustness of the saturation adjustment routine, and I think we can do this. That said, collecting inputs that are fail requires a warning, not an error. So, we really should figure out how to fix #393, or improve the CPU implementation.

charleskawczynski commented 4 years ago

I was able to collect all of the breaking inputs by directly @cuprintlning them in the DYCOMS driver. I opened #642, which adds easily update-able data to the moist thermo test suite. Once #642 is merged, we can systematically experiment with improving saturation_adjustment's convergence as new "bad inputs" come in.

In addition, I opened #104 and, if fixed, we'll be able to print warnings when saturation adjustment fails to converge in a desired number of iterations. Together, we should be able to continuously print new bad inputs, append to the dataset, and continuously iron out convergence problems.

smarras79 commented 4 years ago

such a cool plot!

On Thu, Feb 27, 2020 at 5:26 PM Charles Kawczynski notifications@github.com wrote:

[image: Screen Shot 2019-06-25 at 9 32 32 AM] https://user-images.githubusercontent.com/1880641/75492666-1cefb400-596d-11ea-8eb8-6404b4a250ac.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/climate-machine/CLIMA/issues/263?email_source=notifications&email_token=ALBNJHKRPTTNUGFJQJX5QULRFA4ZLA5CNFSM4HVZOVYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENGGN5I#issuecomment-592209653, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBNJHOKDT556NWWO5HM2GLRFA4ZLANCNFSM4HVZOVYA .

-- Please, forgive the typos , I am writing from my phone!

szy21 commented 4 years ago

Cool plot! Are these data points from DYCOMS? Why would we have e_int around -50000?

charleskawczynski commented 4 years ago

cool plot!

Thanks! I think the points are from the ranges originally suggested. Based on Tapio's suggested procedure, I think we can re-make this plot with translucent iso-contours showing the manifold of values are being tested in saturation adjustment, and still include which points are non-converging.

tapios commented 4 years ago

Looks great. Could you plot temperature in place of e_int? It'll be easier to interpret, and it'll be clearer where the freezing point is.

@szy21: Internal energy is defined relative to the triple point. So you get negative values below the triple point.

charleskawczynski commented 4 years ago

Looks great. Could you plot temperature in place of e_int?

Yes, I can do this on the next iteration.

simonbyrne commented 4 years ago

Looking at the plots of internal_energy_sat vs T, it is clear there are a couple of special features:

  1. If q_vap_saturation < q_t, then it is linear, otherwise it is a monotonic convex function:

    Screen Shot 2020-04-28 at 4 40 03 PM
  2. If this change in regimes occurs when T > T_freeze, then there is also a discontinuity at T_freeze:

    Screen Shot 2020-04-28 at 4 39 57 PM

1017 deals with 2, but it would be good to have a more general approach. An interesting point is that the form of saturation_vapor_pressure means that solving the convex part appears to reduce to evaluating a Lambert W function

tapios commented 4 years ago

Yes, the linear case in 1) is caught by the initial estimate before the nonlinear saturation adjustment is called. (It checks after the linear temperature calculation whether q_t > q_vap_saturation is satisfied.)

Whatever issues we have with saturation adjustment arise near the phase transition at freezing, where the gradient is singular.

charleskawczynski commented 3 years ago

We're now monitoring, via our documentation page, the convergence and non-convergence for a comprehensive input space.

tapios commented 3 years ago

@charleskawczynski Can we close this issue? It has a lot of obsolete and irrelevant points which at this point may be more confusing than helpful.

charleskawczynski commented 3 years ago

Yeah, I think that's probably fine. We have other related but more narrowly defined issues (e.g., #1831).