The simulation step can be tiny when two periodic events collide

hongkai-dai commented 3 years ago

I have a simulation that runs two periodic events, one at 200 Hz (the controller), and another runs at 60 Hz (the visualizer). These two events collide at 20Hz.

With RK3 integrator, sometimes I observe the time step in DoStep function is tiny, in the order of 1E-17. This happens when the two events collide (but not every time they collide). @sherm1 thinks this is because of the roundoff error when computing the timing.

To reproduce the error, add the following lines to simulator.cc

diff --git a/systems/analysis/simulator.cc b/systems/analysis/simulator.cc
index ee272ff9dc..f7d4824c7b 100644
--- a/systems/analysis/simulator.cc
+++ b/systems/analysis/simulator.cc
@@ -274,6 +274,25 @@ SimulatorStatus Simulator<T>::AdvanceTo(const T& boundary_time) {
         next_update_time,
         boundary_time,
         witnessed_events_.get());
+    drake::log()->info("time {}, step size {}", context_->get_time(), context_->get_time() - step_start_time);
+    switch (time_or_witness_triggered_) {
+      // kNothingTriggered = 0b00,
+      // kTimeTriggered = 0b01,
+      // kWitnessTriggered = 0b10,
+      // kBothTriggered = 0b11
+      case Simulator<T>::TimeOrWitnessTriggered::kNothingTriggered:
+        drake::log()->info("Nothing triggered");
+        break;
+      case Simulator<T>::TimeOrWitnessTriggered::kTimeTriggered:
+        drake::log()->info("time triggered");
+        break;
+      case Simulator<T>::TimeOrWitnessTriggered::kWitnessTriggered:
+        drake::log()->info("witness triggered");
+        break;
+      case Simulator<T>::TimeOrWitnessTriggered::kBothTriggered:
+        drake::log()->info("both triggered");
+        break;
+    }

     // Update the number of simulation steps taken.
     ++num_steps_taken_;

And run the anzu simulation with the following steps

Build and run the visualizer with $ ./run --build '' //tools:drake_visualizer
Build and run the simulation with $ ./run --build '' //sim/blocks:blocks_simulation -- --blocks_to_load='' --realtime_rate=0

The simulation will print out some message onto the screen, including the simulation time, simulation step and the triggered event.

@sherm1 @jwnimmer-tri

sherm1 commented 3 years ago

Here is the start of the relevant Slack thread (ignore the earlier parts of that thread).

cc @edrumwri for thoughts

sherm1 commented 3 years ago

I suspect this is due to the calculation of "next update time" for the two periodic events differing by roundoff error when they should be coincident (e.g. small steps were observed at 0.15s, .3s, .35s). Ideally we would consider these events simultaneous and avoid taking the ridiculously small steps.

sherm1 commented 3 years ago

Some thoughts on fixing this odd behavior:

the periodic event code can't do it, because no single periodic event has any way to know what other periodic events it is supposed to be simultaneous with
OTOH the Diagram and LeafSystem implementations of DoCalcNextUpdateTime() do see all the possible future events and currently toss out all but the event or events that occur at the earliest time
my thought is that we could modify that code to keep the earliest events and those that are within some tiny dt of that earliest time
that would be like declaring that Drake will consider events separated in time by some small tolerance are actually simultaneous.
Finally, the Simulator must consider time advanced to the latest of the simultaneous events so that those events won't reoccur.

jwnimmer-tri commented 3 years ago

I would be keen to see a standalone example here. The hypothesis sounds very believable, but when I ran a few 200 Hz vs 60 Hz double steps on paper, I didn't see any epsilons. Maybe I screwed up the math, though. In any case, a small test would provide something easy to iterate on.

SeanCurtis-TRI commented 3 years ago

that would be like declaring that Drake will consider events separated in time by some small tolerance are actually simultaneous.

There's a subtlety I'd like to emphasize. Imagine your window is Δ. I'll define δ = 0.9999 Δ . Imagine I have a sequence of events [0, δ, 2δ]. Each successive pair is "separated in time" by the same small tolerance, but the first and third events are not*. When you talk about this kind of strategy, we need to recognize it's not a transitive relationship.

I think what we really mean is that as we accumulate the events, we want to track all events in the interval [t₀, t₀ + Δ), where t₀ is the earliest time of all events declared. That would gaurantee we'd get events at 0 and δ, but not 2δ.

sherm1 commented 3 years ago

Hmmm, that's a good point. It seems unsatisfying in the (0, δ, 2δ) sequence to say that events (0,δ) are simultaneous but events (δ,2δ) are not even though both sets are exactly as close together in time.

sherm1 commented 3 years ago

I think we want to engineer a transitive solution. Not sure how, but there is a Principle of Least Astonishment issue to address here!

sherm1 commented 3 years ago

cc @rpoyner-tri since you've been in this code most recently

SeanCurtis-TRI commented 3 years ago

p0 = 1 / 200.0
p1 = 1 / 60.0
for i in range(20):
    print(i, p1 * (3 * i) == p0 * (10 * i))
(0, True)
(1, True)
(2, True)
(3, True)
(4, True)
(5, True)
(6, True)
(7, False)
(8, True)
(9, True)
(10, True)
(11, True)
(12, True)
(13, True)
(14, False)
(15, True)
(16, True)
(17, True)
(18, True)
(19, False)

SeanCurtis-TRI commented 3 years ago

The transitive solution must inevitably allow for arbitrarily long windows. What we might say is transitive up to a certain point and then throw.

jwnimmer-tri commented 3 years ago

Hah! I didn't go far enough. I stopped at 5.

SeanCurtis-TRI commented 3 years ago

I was the same and then I got bored and made the computer do the work for me. So, I think we can say we have a good hypothesis!

As a foot note: here's the falses for the first 100 steps -- good luck predicting when it happens:

(7, False)
(14, False)
(19, False)
(23, False)
(28, False)
(33, False)
(37, False)
(38, False)
(46, False)
(51, False)
(56, False)
(61, False)
(66, False)
(69, False)
(71, False)
(74, False)
(76, False)
(79, False)
(87, False)
(92, False)
(97, False)

sherm1 commented 3 years ago

If the "simultaneity tolerance" is small enough we could consider full transitivity I think. For example say we pick 2ε (4e-16). If we had 100 of those in a sequence the first and last would be separated in time by 4e-14 seconds -- that's .00001 nanoseconds!

sherm1 commented 3 years ago

Another quirk (I edited in above) is that the Simulator must advance time to the latest of the "simultaneous" events.

SeanCurtis-TRI commented 3 years ago

That's fine. I'd distinguish between that specific application and the fact that nothing in the code would prevent a billion such times in sequence. If we feel it's reasonable to lump them together up to a point, then we should have a limit if we ever go beyond that point.

SeanCurtis-TRI commented 3 years ago

Transitivity + "Simulator must advance time to the latest of the "simultaneous" events."

The interval can become arbitrarily large. Events at the beginning of the interval cannot be overly sensitive that they're actually told to evaluate at the time at the end of that interval.

sherm1 commented 3 years ago

for i in range(20):
    diff = abs(p1 * (3 * i) - p0 * (10 * i))
    print(i, diff <= 4.4e-16)

(0, True)
(1, True)
(2, True)
(3, True)
(4, True)
(5, True)
(6, True)
(7, True)
(8, True)
(9, True)
(10, True)
(11, True)
(12, True)
(13, True)
(14, True)
(15, True)
(16, True)
(17, True)
(18, True)
(19, True)

SeanCurtis-TRI commented 3 years ago

Important question: as i gets larger, when does that break? We should confirm that it still works for thousands of time steps.

sherm1 commented 3 years ago

The interval can become arbitrarily large.

I'm not sure how that could happen in a real application. Do you have a scenario in mind? Is there even a plausible buggy System that could generate an arbitrarily-long sequence of events separated by roundoff-noise time intervals?

sherm1 commented 3 years ago

Important question: as i gets larger, when does that break?

Good point -- the simultaneity rule needs to be relative, not just absolute.

jwnimmer-tri commented 3 years ago

Even at i=46 (t=2.3 seconds), we already cross that 4.4e-16 tolerance.

I wonder if allowing the leaf system to declare its frequency, instead of its period, would make the numerics work out better.

sherm1 commented 3 years ago

Here is a version more like I would actually implement it, with relative tolerance at large times:

p0 = 1 / 200.0
p1 = 1 / 60.0
atol = 4.4e-16 # absolute and relative tolerance

maxadiff = 0 # max absolute difference
maxrdiff = 0 # relative error at maxadiff
maxtime = 0  # time at maxadiff
maxi = 0     # value of i at maxadiff
for i in range(1000000):
   t1 = p1 * (3 * i)
   t2 = p0 * (10 * i)
   tol = atol * max(1, t1)
   diff = abs(t1 - t2)
   if diff > maxadiff:
       maxi = i
       maxadiff = diff
       if t1 > 0:
           maxrdiff = diff / t1
       maxtime = t1
   if diff > tol:
       print(i, t1, diff)
print("max i,time,adiff,rdiff=", maxi, maxtime, maxadiff, maxrdiff)

The output is:

max i,time,adiff,rdiff= 699054 34952.7 7.275957614183426e-12 2.0816582450521495e-16

That is, at t=35000s it considers events to be simultaneous if they occur within .007ns of one another, which is a relative difference in time of one part in 5e15.

sherm1 commented 3 years ago

Per a creative Nimmer idea, imagine if the visualizer ran at 64Hz instead of 60Hz. In that case there would be no roundoff error. I modified the above program to do that (replace 3,10 with 8,25). Here's the output:

max i,time,adiff,rdiff= 0 0 0 0

!

SeanCurtis-TRI commented 3 years ago

I'm not sure how that could happen in a real application.

Proof by incredulity? I can't believe it can happen, so it can't?

All I know, is that the concept you're proposing has a particular undesirable property. Betting that no one will run up against that property seems optimistic. I'd have to be persuaded that testing the assumption hasn't been invalidated is too expensive to do in practice before just relying on outcomes being "improbable".

Last question: so, 60 Hz and 200 Hz have a particular pattern. Why don't we have to worry about other combinations of frequencies? Tuning for this combination hardly feels like a true solution.

jwnimmer-tri commented 3 years ago

My suggestion on slack of playing games with the rates was in the context of a hack to quickly mitigate the problem in one particularly critical simulation of interest.

I do think this issue is real and fundamental, and we should solve it in a robust way, for a wide range of rates.

sherm1 commented 3 years ago

Counterargument by sarcasm isn't that convincing either :)

Lots of bad things can conceivably happen (unreasonably large vectors being normalized for a recent example). We aren't obligated to catch them all, especially when doing so is difficult (and here we would have to invent a cutoff interval, justify it, and document it). If we don't do that, we can simply say that events that occur within, say, a fraction of a nanosecond of one another will be considered simultaneous, with justification being that we support robotics applications and not molecular ones. I don't have a strong objection to putting in some kind of bugcatcher but it doesn't seem necessary to me. We could wait until the problem occurs in practice before trying to fix it; I speculate that that will never happen.

Not sure what you mean by optimizing for 60 & 200 Hz. The proposed fix is agnostic to the frequencies. Can you elaborate?

EricCousineau-TRI commented 3 years ago

Marginally relates: https://github.com/RobotLocomotion/drake/issues/10497#issuecomment-833818637

SeanCurtis-TRI commented 3 years ago

It wasn't sarcasm. It was another form of irony: "hyperbole". :) That said, while I agree we can't put fingers in every hole in the dike, introducing this concept of false transitivity is poking a hole intentionally. It seems it should come with a bucket by design.

Optimizing for 60 Hz and 200Hz

The flavor that is occupying my mind is adjacent to the concept of things being relatively prime. In your code example, you used a tolerance of 4.4e-16 scaled by how far away the value was from zero. IT worked for frequencies 200 and 60. Could that be related to the common factors of 200 and 60? Does it get better if the common factors are larger? Smaller? Uglier? Of is the tolerance appropriate for all possible prime factors and it's just the frequency at which the tolerance kicks in that changes? I'm asking because my intuition doesn't give me a confident answer.

I can imagine the larger the smallest(largest?) common factor, the bigger the possible error between two otherwise "identical" values of t. But that "largeness" might already be consumed by the scale factor you have for computing relative error.

edrumwri commented 3 years ago

cc @edrumwri for thoughts

I'll remind you gents that I only check Drake's github notifications every few days. If you want a same-day response from me, ping me over Slack or email.

Sherm, your early analysis tracked exactly with my initial thoughts, and you covered one of my later thoughts too (relative vs. absolute tolerances). You guys have obviously spent quite a bit of time on the analysis at this point, so I don't have anything new to add except some philosophy.

I'd like to understand the Drake team's concern here- is it one of efficiency (the small step), one of hard-to-explain behavior (the sometimes simultaneous fire), calculations that don't give an expected result (the simulation trace looks vastly different for small deltas in frequencies), or some combination?

I might think the most surprising thing here from a user perspective is that the frequencies are integral, so 200 Hz and 60 Hz might be expected to collide at integral intervals. Except that to see the least common multiple, you have to turn those into periods, at which point it's obvious that the 1/60 does not give a precise representation in binary (or base 10). At that point, "it's floating point arithmetic - duh!" So, should we expect the user to understand so little about floating point representations that we do something sophisticated to compensate? At least Sherm and I both seemed to be able to diagnose the problem immediately (and I expect that everyone else in on this discussion got it pretty quickly too).

If the primary motivation here is "solving" the principle-of-least-astonishment problem, I'm worried that the result will be a lot of code with a tricky solution that might be better addressed by documentation alone.

amcastro-tri commented 2 years ago

FYI, we recently hit this @joemasterjohn.

We do understand the problem now.

However, we were surprised by this and, in a diagram with lots components it was a challenge for us to dig deep down what system was responsible for introducing a period slightly different from the one introduced by the plant.

I think I agree with Evan here, attempting to fix this might lead to an even darker and much harder to test code.

However, it'd seem in these cases these tiny steps are in the order of machine epsilon? would it be worth to at least emit a one time warning (per system/update?) when a machine epsilon time step gets taken? that way the user can quickly find out systems with say, period 0.01 and 0.05 and then take a decision on whether to change them or not.

jwnimmer-tri commented 2 years ago

I agree that making this easier to diagnose would be a nice benefit.

If the typical cause of this symptom is systems with periodic updates (rather than, e.g., witness functions), then we can compute from the Diagram itself up front whether there will be 1-ulp steps in its schedule, by taking an inventory of the set of periods and doing some math. We would offer that report as some helper function of a Diagram. It could specifically point out which system names and periods cause various problems. I like doing it statically and comprehensively across the entire update schedule, instead of waiting for runtime and just catching the first one.

I also wonder if a more helpful fix overall would be to offer an API to declare a periodic event using its frequency (i.e., 10.0 Hz or 100.0 Hz) instead of the period. I think we might be able implement that in such as way that we avoid ulp-steps. Most of the problems here come from people using say 0.01 and assuming that means 100.0 Hz, when in fact it does no such thing.

RobotLocomotion / drake

The simulation step can be tiny when two periodic events collide #15021