SciML / OrdinaryDiffEq.jl

High performance ordinary differential equation (ODE) and differential-algebraic equation (DAE) solvers, including neural ordinary differential equations (neural ODEs) and scientific machine learning (SciML)
https://diffeq.sciml.ai/latest/
Other
556 stars 209 forks source link

t approaching NaN not handled by perform_step/composite_perform_step.jl #643

Open daviehh opened 5 years ago

daviehh commented 5 years ago

[sorry for being unable to share the exact set of odes that causes this problem; mathematically, it's about several hundreds of coupled odes, many of which stays very close to zero, and subject to fluctuations]

Problem encountered: possibly due to the wild fluctuations, with adaptive time steps, the value of t get passed to the problem function can become NaN, which is not handled by the ode solver but by julia (e.g. when there's an interpolation function inside the problem equation, it would give the error message that you are attempting to access the function at index NaN). Possibly due to the fact that dt can become very small, e.g, with this callback (output truncated to %1.3f)

uam(x) = maximum(map(abs,x))
cb = FunctionCallingCallback((u,t,integrator)->println(integrator.t,":",uam(integrator.u)))

sample output:

0.003:1.000 0.028:1.000 0.199:0.998 ...

[okay so far]

16.824:0.840 16.904:0.835 16.976:0.874 17.027:7.875 ...

[begin to diverge]

17.072:52.823 17.115:304.060 17.155:2068.261 17.192:73776.659 17.228:1889676.893 17.262:37566434.851 17.295:590385235.761 17.328:8037746803.093 17.359:92405617354.253 17.390:964878993043.117 17.419:41835928438926.406 17.446:2472520661018154.500 17.474:112489881276322096.000 17.500:3996033157284148736.000 17.526:114081425160125120512.000 17.551:2717053292667096530944.000 17.576:62812605066626635661312.000 17.599:2672990280005583124824064.000 17.622:155542018983915024310861824.000 17.643:7817184444140220891879440384.000 17.665:328439203693578754591230525440.000 17.685:12090935523820259538630623625216.000 17.705:495844916126253879281704855666688.000 17.724:27184979927934443122093490106793984.000 17.742:1486867133374745438129052541228417024.000 17.759:73013487105283215710103760027101691904.000 17.776:3518034441013199899526820710066706972672.000 17.792:192144047767335185103262231743436242812928.000 17.808:10925567602135435428146432593002051856760832.000 17.822:596010402945870701122856748680812433053319168.000

[then, dt essentially stops advancing..]

... 18.169:34375374839930768530674540562825177631705562025210476594184705868216112233552706284908707206431723288577053474940936605766000372397030396634195598716088314921452908585378583738013087843577553807472269315131364239708509578058031244708409051569543803559970581885761205455536739695140732655904426557440.000 18.169:1619942826141944245683980724052727785353433555895585391326912784384272416806804474718168437001942604385507894863219597994177859695652188153341672613924625336603286709544583863278663020786414951684025228032853104784429752027323141277007058808118986371636329361148691572789213143440033527888552992440320.000

[ and julia throws the error message]

ERROR: LoadError: BoundsError: attempt to access 601-element scale(interpolate(OffsetArray(::Array{Complex{Float64},1}, 0:602), BSpline(Cubic(Line(Interpolations.OnGrid())))), (0.000:0.050:30.000,)) with element type Complex{Float64} at index [NaN]

where the scaled interpolate function lives inside my ode problem function. (order of magnitude for the last returned data: u=...e+300 with dt=1.4...e-05)

It's a bit of an edge case, but maybe this can be checked inside the composite_perform_step.jl so that the error is returned by the retcode of the ode solver? Also, is there a callback one can define to force the ode solver to stop when some function g(u) = false, e.g. when one expects the abs of u to be smaller than 1, but the solver at some time t gives a value larger than the threshold, then forcibly stop the solver and set a corresponding sol.retcode? Thanks!

ChrisRackauckas commented 5 years ago

Also, is there a callback one can define to force the ode solver to stop when some function g(u) = false, e.g. when one expects the abs of u to be smaller than 1, but the solver at some time t gives a value larger than the threshold, then forcibly stop the solver and set a corresponding sol.retcode?

Basically. Look at:

http://docs.juliadiffeq.org/latest/basics/integrator.html#Stepping-Controls-1

In here's there's terminate!(integrator), which terminates the integration. You can put that in the affect! function of a callback.

daviehh commented 5 years ago

In here's there's terminate!(integrator), which terminates the integration. You can put that in the affect! function of a callback.

@ChrisRackauckas Thanks a lot for the suggestion! Quick question: since I'm not trying to modify u or t with the callback but just to terminate the solver, is the affect! necessary? i.e. with respect to just stopping the integrator, is there any difference (safe/performance) between

thres = 1.0
cb = FunctionCallingCallback((u,t,int) -> if maximum(map(abs,u))>thres terminate!(int) end)

and the corresponding cb = DiscreteCallback((u,t,int) -> maximum(map(abs,u)) > thres, (int) -> terminate!(int))

daviehh commented 5 years ago

Is there some documentation for the cache output? full_cache returns empty () and get_tmp_cache returns a vector of the same size as u.

By putting something like println(t,u) and also print the t through function call back, it looks like at some time e.g. t=10, reported by the call back, there are many calls to the problem function at various times after 10, likely to calculate k2 through k7 in e.g. here Is there a way to access the k/a/c values there from cache? Looks like in certain cases, c1*dt...c6*dt may exceeds the tspan/become NaN

ChrisRackauckas commented 5 years ago

Looks like in certain cases, c1dt...c6dt may exceeds the tspan/become NaN

Those values cannot exceed tspan because the c's are less than 1 and dt <= tspan[2] - t. Unless you're doing an early termination at a callback, in which case there's no way to know when you're at that boundary without going over it since it's an implicit problem.

there are many calls to the problem function at various times after 10, likely to calculate k2 through k7

integrator.k[i].

Is there some documentation for the cache output? full_cache returns empty () and get_tmp_cache returns a vector of the same size as u.

http://docs.juliadiffeq.org/latest/basics/integrator.html#Caches-1 . It sounds like you're hitting a bug if full_cache is empty. What algorithm are you using? I would suggest choosing something like Rodas5().

daviehh commented 5 years ago

@ChrisRackauckas I'm using the AutoVern8(Rosenbrock23(autodiff=false)) solver. This is not a good way to do the debugging, but by inserting println("p_step!dt:",dt,"+:",t) to here, and having just println("get_dt",get_proposed_dt(integrator)) in my callback, it looks like dt goes to NaN in a call to perform_step!(), which happens without invoking the callback, i.e. normally, I get a "get_dt" output followed by a single "p_step!dt" output from perform_step(). Right before the error occurs, I get a get_dt output from the callback, followed by a perform_step output with a reasonable dt, then another output from perform_step with dt=NaN. I originally printed c2*dt so I thought c2 is the issue, but now it looks like dt is somehow set to NaN during calculation. What debugging step can you recommend? Thanks a lot for your help!

ChrisRackauckas commented 5 years ago

I wouldn't debug with a stiffness-switching algorithm. I would just do Rosenbrock23(autodiff=false) to keep it simple. That's a straightforward L-stable method.

Your issue is likely occurring here: https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/ee977e89e7b09d05bb99c6ca507dd8aca4e03b97/src/perform_step/verner_rk_perform_step.jl#L586-L590 . Basically, the way adaptivity works is it steps two methods and then uses their difference to get an error estimate. The issue is likely that, since the last methods are where that value is >1, they exist where your derivative doesn't have a good solution, so the last few ks are NaN, and then the EEst is NaN, and then dt based on EEst becomes NaN.

To fix this, make it so that way your model is always well defined. Use min(x,1) where it's an issue. That won't effect the integration where x<1, and then if you have the callback searching for x>=1, it'll stop the integration when x=1 so the extra part won't actually effect the solution.

daviehh commented 5 years ago

@ChrisRackauckas Thanks for the suggestions! Do you think it's better to add a guard to all function calls made to the problem function, e.g.

not_in_tspan(t,tspan) = (t<tspan[1] || t>tspan[2])
if not_in_tspan(t,tspan) || any(isnan,u)
 [terminate_solver_routine]
end

or

if not_in_tspan(t,tspan) || any(isnan,vcat(u,du))
 [terminate_solver_routine]
end

for the in-place variant? Two points to consider:

using DifferentialEquations

g(t) = t>=0.0 ? -3 : nothing

function f(du,u,p,t)
    if t<0
        println("prob not defined at $t")
    end
    du[1] = -1*u[1]+g(t)
    du[2] = -2*u[2]
end

u0 = [1.2; 0.3]
tspan = (0.0,30.0)

prob = ODEProblem(f,u0,tspan)
alg = Rodas5(autodiff=false)

sol = solve(prob,alg)

println(sol.retcode)

Note adding a callback to deal with t<0 does not work since not all function calls to the problem function uses the callback, e.g.

function cbf(u,t,integrator)
    if t<0
        println("t-")
        terminate!(integrator)
    end
end
cb = FunctionCallingCallback(cbf)
sol = solve(prob,alg,callback=cb)

This is relevant since the problem function may involve e.g. some interpolation function gi(t) inside, and gi() is not defined outside tspan, which causes an error not handled by the solver but reported by julia.

Dual{ForwardDiff.Tag{DiffEqDiffTools.TimeGradientWrapper{ODEFunction{true,typeof(my_function),...... }},Array{Int64,2},Int64,Int64,Float64,Float64}}},Float64}}(0.000e+00,1.000e+00)

ERROR: LoadError: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{DiffEqDiffTools.UJacobianWrapper{ODEFunction{true,typeof(my_function),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Float64,...}},Array{Int64,2},Int64,Int64,Float64,Float64}}},Float64},Float64,11})

Any pointer about what may cause that? Solver used is alg = Rosenbrock23()

ChrisRackauckas commented 5 years ago

oooooooooooooooooooooooooooooooooooooooooooooh. I was wondering how it would end outside of your time interval. Now it all makes sense. When it does numerical differences with respect to t in order to get the t gradient, then yes it goes slightly forward and slightly behind in time. It sounds like your model cannot handle that, so you'd not want to do autodiff=false.

For autodiff to work, you'd need to make sure that your cache arrays are defined in such a way that they can handle Dual numbers. See this discussion: http://docs.juliadiffeq.org/latest/basics/faq.html#I-get-Dual-number-errors-when-I-solve-my-ODE-with-Rosenbrock-or-SDIRK-methods...?-1

Now the easiest fix would be to just use a method for stiff equations which doesn't require computing a time-wise gradient. Any non-Rosenbrock method would have that property, for example, KenCarp4 may be worth a try. Or even CVODE_BDF should work here.