visr commented 3 months ago

1697 increased the performance considerably, and passed the tests with only slightly higher tolerances.

Surprisingly @Fati-Mon observed this behavior on a Basin running dry:

We need to look into what is going on here.

Huite commented 3 months ago

I guess it's possible that storage still ends up just barely negative somehow and one of the du terms blows up when it sees a negative storage?

visr commented 3 months ago

We reject any timesteps that lead to negative storage https://github.com/Deltares/Ribasim/issues/1406. Or perhaps that is too late for this, which is triggered during nonlinear iterations?

Huite commented 3 months ago

Hmm good point. It shouldn't be too late: if it converges with a negative storage, it should still be rejected. Both the callback and isoutdomain should be triggered. So in this case I guess there's really no negative storage involved.

Maybe some term that isn't smooth enough around S = 0?

Does any other water balance term jump?

visr commented 3 months ago

Yeah perhaps we are stepping completely over a reduction factor here. The inflow is 29 m3/s and the outflow 62 m3/s. The outflow should go down to the inflow due to the reduction factors, but all I see is a brief minor dip in the outflow over the Outlet (50 m3/s), and no reduced flow to the UserDemand (12 m3/s).

The complete water balance at the start of the anomaly:

time	node_id	storage	level	inflow_rate	outflow_rate	storage_rate	balance_error
2023-11-22	3	5.3E+06	6.1E+01	2.9E+01	6.2E+01	-3.3E+01	8.2E-14
2023-11-23	3	2.4E+06	4.1E+01	2.9E+01	6.2E+01	-2.7E+01	-5.9E+00
2023-11-24	3	9.5E+04	7.9E+00	2.9E+01	6.2E+01	1.1E+04	-1.1E+04
2023-11-25	3	9.4E+08	8.2E+02	2.9E+01	6.2E+01	3.8E+04	-3.8E+04
2023-11-26	3	4.2E+09	1.7E+03	2.9E+01	6.2E+01	6.5E+04	-6.5E+04

The model looks like this:

And can be downloaded here:

crystal.zip

The results with v2024.10.0 look much better. When in > out, the storage is going up from 0, when out > in, the storage goes down to 0. When storage is 0, out will never go above in, so they are the same.

visr commented 3 months ago

A few more observations. QNDF, TRBDF2 and KenCarp4 all have similar issues. If you reduce the tolerances you get an unstable error. ImplicitEuler does seem to produce the correct results with backtracking. On the HWS model I couldn't directly eyeball large convergence issues with QNDF, though there are no empty Basins there. HWS with backtracking runs in 47s for QNDF and in 2m19 for ImplicitEuler.

Perhaps for now we should make backtracking opt-in for now while we explore this issue. Perhaps we can look at what is happening with backtracking more in detail, based on https://github.com/SciML/OrdinaryDiffEq.jl/pull/1792.

This made me realize #1413 needs a v1.0 label, because that helps to just error out on these issues to protect users.

SouthEndMusic commented 3 months ago

@visr you can also try different linesearch algorithms for relaxation (https://github.com/JuliaNLSolvers/LineSearches.jl) but some more insight would be very helpful.

Also, we've probably discussed this before, but what about incorporating the water balance error in isoutofdomain?

visr commented 3 months ago

Yeah I can do that next week. I think the water balance error is important to address, #1413., Not sure if isoutofdomain should be used for this (at least the name sounds wrong), though perhaps one of https://docs.sciml.ai/DiffEqCallbacks/stable/step_control/ makes sense.

Huite commented 3 months ago

Just to make sure I'm reading properly: the upper plots show timeseries that are totally wrong? It flies off course immediately, right?

And secondly: the same solver doesn't run into this problem when there's no relaxation?

In my understanding, the relaxation only limits the update of the Newton-Rhapson step. The check for convergence should be the same -- so I don't understand how it finds such a radically different solution.

SouthEndMusic commented 3 months ago

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

Huite commented 3 months ago

Doing some homework on the solvers:

Newton-Rhapson with ImplicitEuler is straightforward: for each Newton iteration, compute the Jacobian at $ui$, then solve for $u{i + 1}$
QNDF uses a Quasi Newton (Difference Formula) method. The quasi part comes from the fact that it uses an approximate Jacobian, trying to avoid a full Jacobian update. When it degrades too much (not sure how this is detected), it should do a full Jacobian reformulation.
TRBDF2 and KenCarp4 are so called ESDIRK methods: Explicit first stage, Single diagonal coefficient, Diagonally Implicit Runge-Kutta.

The DiffEq docs mention:

For asymptotically large systems of ODEs (N>1000?) where f is very costly, and the complex eigenvalues are minimal (low oscillations), in that case QNDF or FBDF will be the most efficient. QNDF and FBDF will also do surprisingly well if the solution is smooth. However, this method can handle less stiffness than other methods and its Newton iterations may fail at low accuracy situations. Other choices to consider in this regime are CVODE_BDF and lsoda.

Curious whether CVODE_BDF or lsoda fail in a similar way.

Huite commented 3 months ago

@Huite you say limits the step and I thought so too, but I haven't seen anything in the linesearch code that indicates that the step cannot be bigger than the original Newton step

I think it's this in: https://github.com/JuliaNLSolvers/LineSearches.jl/blob/3259cd240144b96a5a3a309ea96dfb19181058b2/src/backtracking.jl#L22

α is the size of the update, default size is 1:

α_0::Tα = real(T)(1)

α = 1 is a standard Newton step.

Then it's updated:

α_0 = min(α_0, min(alphamax, ls.maxstep / norm(s, Inf)))

So α_0 shouldn't exceed 1.0 (unless you fiddle with it?).

Then in the actual function, it's doing interpolation via either a quadratic or a cubic and finding the zero for it. I think these lines mean it's limited:


α_1, α_2 = αinitial, αinitial

# interpolation
# ...

α_tmp = NaNMath.min(α_tmp, α_2*ρ_hi) # avoid too small reductions
α_2 = NaNMath.max(α_tmp, α_2*ρ_lo) # avoid too big reductions

Defined above:

    ρ_hi::TF = 0.5
    ρ_lo::TF = 0.1

So after the first iteration 0.1 <= α_2 <= 0.5.

Afterwards, it should only get smaller. (Unless something's changing defaults.)

Huite commented 3 months ago

To get a slightly better feeling for these methods: https://www.stochasticlifestyle.com/differences-between-methods-for-solving-stiff-odes/

SouthEndMusic commented 3 months ago

TL;DR: I still think we should give this a try.

The outflow should go down to the inflow due to the reduction factors

@visr that is not how it is implemented now, but that is what is suggested here. Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with https://github.com/Deltares/Ribasim/pull/1606 in place (at least we can test with dense Jacobians).

Additionally, with the current implementation it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn) than with the aforementioned suggestion.

I played around a bit with writing a custom line search algorithm, from which I learned that:

The solver is happy to work with negative storages as intermediate results in the non-linear solver. That makes sense now that I observed it, and hints at when isoutofdomain is triggered.
I tried enforcing non-negative storages with a custom line search but that does not work, because line search is not always triggered, I guess because sometimes the initial guess for the non linear solver is already good enough
The fraction α of the step can get extremely small, in the order of 1e-19.

I also learned (with the help of claude.ai) that QNDF does a predictor step based on previous results, which I can imagine can be very off with an abrupt change in dynamics due to the reduction factor

Huite commented 3 months ago

Currently outflows of a basin don't depend on inflows (only on the storage), but to me it makes sense that that would be the case. Note that this adds new non-zeros to the Jacobian, which is much easier to play with with https://github.com/Deltares/Ribasim/pull/1606 in place (at least we can test with dense Jacobians).

But the storage does depend on the inflows, and outflows depends on the storage. So in the (linearized) system of equations outflows clearly do depend on the inflows.

The solver is happy to work with negative storages as intermediate results in the non-linear solver

So the result of a Newton iteration might contain negative storages?

Does this mean it's calling the rhs function (water_balance!) with negative storages? How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

it is much more finnicky to find a steady state solution for depleted basins with in- and outflow (the solver has to find exactly that storage where the reduction factor makes the outflow match the inflow, in hindsight that's probably what you meant Martijn)

I don't follow why is is more finnicky? Isn't a basin always above depletion if there's still outflow? The outflow is distributed across different terms, each very small. The linearized problem provides enough information that the linear solver can find such a point, although it may result in negative storage. If the next rhs evaluation results in setting all outflows to 0, the next iteration should result in a small positive storage again.

Apparently the solver will dip into negative storage, re-formulate, and then converge with (wrong) positive storages -- otherwise the solution would be rejected. Shouldn't that go away it outflow becomes zero with negative storage then?

evetion commented 3 months ago

Does this mean it's calling the rhs function (water_balance!) with negative storages? How many of our du/dt terms are robust for negative storage? Now that I think of it, it might be smart to have unit tests in place for each du/dt term so they we know they accidentally explode. I reckon every potential outflow should return 0 for negative storage (?).

Might be relevant, docstring of the PositiveDomain callback (sorry for the garbled math statements):

Please note, that the system should be defined also outside the positive domain, since even with these approaches, negative variables might occur during the calculations. Moreover, one should follow Shampine's et al. advice and set the derivative xi′xi′ of a negative component xixi to max⁡{0,fi(x,t)}max{0,fi(x,t)}, where tt denotes the current time point with state vector xx and fifi is the ii-th component of function ff in an ODE system x′=f(x,t)x′=f(x,t).

Huite commented 3 months ago

Good point, I remember reading this. The Shampine paper is this one:

https://www.sciencedirect.com/science/article/abs/pii/S0096300304009683

From the PDF, with non-garbled math:

My first suggestion now would be to add some unit tests to enforce du=0 for S <= 0, and potentially add a sigmoid to sidestep resulting discontinuities.

visr commented 3 months ago

Ha, we had implemented that advice but removed it when we started using isoutofdomain, see https://github.com/Deltares/Ribasim/pull/639/files. I'll try adding that back and see the effect.

EDIT: For the test model I see negative storage, but not with a negative du, so it had no effect.

Huite commented 3 months ago

Uh I think setting the total du to zero might not be the best solution. I'm not privy enough to the fancier solvers which track several previous iterations, but if you're using ImplicitEuler, it will base du only on the last iteration. In that case, if u < 0, then du for those states will be set to zero as well, and every iteration afterwards will also set du=0. Only in the next time step will it then reformulate and become dynamic again.

Instead, I think you should allow positive individual terms (inflow e.g. precipitation), but truncate individual negative terms. That way, if you end up below zero it can still go back in the positive.

EDIT: pretty sure this is what MODFLOW 6 does for the NPF (internodal flows) Newton formulation too.

Huite commented 3 months ago

Other suggestions:

just try a scalar relax value.
Evaluate residual and repeatedly apply a scalar value (both SWAP and MODFLOW 6 use this approach):

SWAP:

MODFLOW:

visr commented 2 months ago

Fixed by #1761.

Deltares / Ribasim

Look into BackTracking effect on new test model #1705

1697 increased the performance considerably, and passed the tests with only slightly higher tolerances.