Show convergence bottleneck nodes to users

visr commented 1 year ago

Related to #419, if a run doesn't converge or is slower than expected, we need good tools to investigate the issue.

One thing that will help is to look at the solver stats:

julia> using Ribasim
[ Info: Precompiling Ribasim [aac5e3d9-0b8f-4d4f-8241-b1a7a9632635]

julia> model = Ribasim.run("nl.toml")
┌ Warning: Newton steps could not converge and algorithm is not adaptive. Use a lower dt.
└ @ SciMLBase C:\Users\visser_mn\.julia\packages\SciMLBase\wvDeR\src\integrator_interface.jl:612
Model(ts: 2, t: 2020-01-01T00:00:00)

julia> model.integrator.sol.stats
DiffEqBase.Stats
Number of function 1 evaluations:                  8803
Number of function 2 evaluations:                  0
Number of W matrix evaluations:                    1
Number of linear solves:                           10
Number of Jacobians created:                       1
Number of nonlinear solver iterations:             10
Number of nonlinear solver convergence failures:   1
Number of rootfind condition calls:                0
Number of accepted steps:                          0
Number of rejected steps:                          0

Additionally it would be nice to locate the nodes in the network causing the issues easily. One thing @Huite suggested was to calculate some stability criterium, like the Basin retention time or courant number.

The goal of this issue is to collect some tools, some of which could be useful for the documentation.

visr commented 1 year ago

We already depend on TimerOutputs but currently don't really use it, but if we hook it up properly the results can be really helpful, like for https://github.com/trixi-framework/Trixi.jl.

 ────────────────────────────────────────────────────────────────────────────────────
              Trixi.jl                      Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              595ms /  84.3%           29.3MiB /  50.9%

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 I/O                            3    290ms   57.8%  96.7ms   6.91MiB   46.4%  2.30MiB
   ~I/O~                        3    189ms   37.6%  63.0ms    948KiB    6.2%   316KiB
   save solution                2    101ms   20.1%  50.4ms   5.84MiB   39.2%  2.92MiB
   get element variables        2    380μs    0.1%   190μs    151KiB    1.0%  75.7KiB
   save mesh                    2    400ns    0.0%   200ns     0.00B    0.0%    0.00B
 analyze solution               2    188ms   37.4%  93.9ms   7.86MiB   52.7%  3.93MiB
 rhs!                          91   24.3ms    4.8%   267μs    137KiB    0.9%  1.51KiB
   reset ∂u/∂t                 91   21.7ms    4.3%   239μs     0.00B    0.0%    0.00B
   volume integral             91    776μs    0.2%  8.53μs   25.6KiB    0.2%     288B
   ~rhs!~                      91    459μs    0.1%  5.04μs   9.33KiB    0.1%     105B
   interface flux              91    437μs    0.1%  4.80μs   29.9KiB    0.2%     336B
   prolong2interfaces          91    366μs    0.1%  4.03μs   27.0KiB    0.2%     304B
   surface integral            91    264μs    0.1%  2.90μs   24.2KiB    0.2%     272B
   Jacobian                    91    223μs    0.0%  2.45μs   21.3KiB    0.1%     240B
   prolong2boundaries          91   10.9μs    0.0%   120ns     0.00B    0.0%    0.00B
   prolong2mortars             91   10.3μs    0.0%   113ns     0.00B    0.0%    0.00B
   mortar flux                 91   9.50μs    0.0%   104ns     0.00B    0.0%    0.00B
   source terms                91   6.50μs    0.0%  71.4ns     0.00B    0.0%    0.00B
   boundary flux               91   6.20μs    0.0%  68.1ns     0.00B    0.0%    0.00B
 calculate dt                  19   6.60μs    0.0%   347ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

Huite commented 7 months ago

We have somewhat extensive experience with slowly running models in variable density SEAWAT simulations, since the solute transport component MT3D does adaptive timestepping. Without exception, large flows in small cells are a problem. The way we analyze those is by analyzing flow output and computing flow velocities.

The same is probably true for Ribasim: if you have output, you can color edges by computed flows. You can also compute derived data, such as average residence times (current storage / current outflow). These can be easily plotted on a map, and will likely help identifying which nodes are problematic.

gijsber commented 7 months ago

Flow velocities are also appreciated output from an ecological perspective (as indicated by Ellis Penning)

Huite commented 7 months ago

Another thought: we're currently relying on DifferentialEquations.jl machinery to do the timestepping for us.

As far as I know, it compares a higher and lower order to estimate whether time step sizes have to be adjusted. Essentially, I expected it to compare du_lower and du_higher. If we can investigate these a bit and get access to them, we should be able to output them and interpret them spatially.

This should help us and users to identify what the problematic nodes and connections are.

SouthEndMusic commented 5 months ago

I asked about this on the Julia discourse:

https://discourse.julialang.org/t/find-out-which-state-s-cause-convergence-problems/113738

SouthEndMusic commented 5 months ago

Chris (the man, the myth, the legend) has responded. Something like this works (for certain algorithms):

(; cache, p) = model.integrator
(; basin) = p
if hasproperty(cache, :atmp)
    storage_error = abs.(cache.atmp.storage)
    perm = sortperm(storage_error, rev=true)
    println("Basins in descending order of being a convergence bottleneck:")
    for i in perm
        node_id = basin.node_id.values[i]
        error = storage_error[i]
        println("$node_id, (error = $error)")
    end
end

though this should probably be logged, and we should have a sensible threshold value for when the error is large enough to mention. This also relates to how these error values should be interpreted, which I don't know exactly how to do yet.

_Edit: they come from calculate_residuals. Maybe the aforementioned threshold can simply be reltol._

Deltares / Ribasim

Show convergence bottleneck nodes to users #452