Integrators should check for derivatives containing NaN and reject

sherm1 commented 4 years ago

Our integrators use EvalTimeDerivatives() which may return with NaN values for many reasons. Continuing integration at that point can cause awful problems, such as an infinite sequence of 0-length steps. Eventually the results are all-NaN and users know something is wrong, but by then it is very difficult to find the root cause.

All the integrators should detect NaN derivatives and respond appropriately. A fixed-step integrator should report a message and halt. The message should contain as much useful info as possible -- e.g. the offending subsystem, type of derivative (qdot, vdot, zdot) and first bad element. Variable step integrators should treat the NaN as an indication that the attempted step was too big, and reduce/retry until the min step size is reached. At that point it should fail with the same message as the fixed-step integrator does.

Note that there have been a number of improvements to integrator NaN handling (see issue #12201, PRs #12217 #12263). However, none of those notice NaN derivatives and a NaN-driven infinite loop at t=0 (using RK3) came up in the system discussed in #12643.

sherm1 commented 4 years ago

cc @edrumwri @antequ

amcastro-tri commented 4 years ago

Variable step integrators should treat the NaN as an indication that the attempted step was too big, and reduce/retry until the min step size is reached.

I would've thought xdot = NaN should halt whether you are a fixed-step or error controlled integrator. I do understand that if the computed x(t+h) = NaN then the integrator could attempt a smaller step (say Newton-Raphson went horribly wrong). However xdot = f(t, x) seems to always be a bad thing? it should be a continuous function of time and state, and the integrators always work within safe bounds given the simulator already isolated events and such. no?

antequ commented 4 years ago

At one point, Evan mentioned all the integrator code and checks had been specifically written to be safe against NaNs. However, because we don't have any specific tests with NaNs in the integrator test suite or any mentions in the comments, I imagine that code is very brittle and may have been (or can be easily) modified to lose that property. I suggest adding tests in the integrator test suite as well to catch this.

sherm1 commented 4 years ago

I would've thought xdot = NaN should halt whether you are a fixed-step or error controlled integrator.

The reason this should be handled differently in variable-step integrators is that they are permitted to attempt optimistic large trial steps which can be retracted. Here's a concrete example I've seen in practice:

I decide to model a ball joint (or free body) using simpler RPY coordinates rather than a quaternion, because I know that in practice the actual trajectory solution won't ever pass through the pitch singularity (say pitch never exceeds 85 degrees).
I integrate at high accuracy with a variable-step integrator.
The trajectory is smooth as it approaches its maximum pitch angle, so the integrator tries a larger step.
The larger step does hit the pitch singularity, producing a bunch of NaNs.
The integrator retracts the trial step, takes a smaller (and thus more accurate) step which remains below 85 degrees.
System now continues on with pitch reducing.

Were the integrator to halt due to the bad trial step, the correct trajectory couldn't be calculated. Only if the NaN persists down to the minimum step size should the integrator fail.

In contrast a fixed-step integrator is always running at its minimum (and maximum) step size so should fail immediately if it encounters a NaN.

sherm1 commented 4 years ago

BTW any state-dependent error should be treated like this -- if the derivative method is written more carefully it should throw or report an error if it can't evaluate, and the integrator should try a smaller step rather than die. Treating NaNs the same way makes the system more robust, since they are a common result of executing code outside its expected domain and there is no way to stop a dumb integrator trial step from generating bad trial states.

EricCousineau-TRI commented 4 years ago

As @sherm1 cross-referenced, ran into this in #12926. Here's a min repro in Python (using 458f95cea) of what @mpetersen94 encountered:

import numpy as np
from pydrake.all import (
    DiagramBuilder, ConstantVectorSource, Integrator, Simulator)

builder = DiagramBuilder()
source = builder.AddSystem(ConstantVectorSource([np.nan]))
integrator = builder.AddSystem(Integrator(1))
builder.Connect(source.get_output_port(0), integrator.get_input_port(0))
diagram = builder.Build()
simulator = Simulator(diagram)
simulator.AdvanceTo(0.)  # loops forever
print("Done")

How hard is it to have this print out a dumb error for now? e.g. "Hey, you got a NaN. Figure it out."

RobotLocomotion / drake

Integrators should check for derivatives containing NaN and reject #12647