Open hongkai-dai opened 3 years ago
Here is the start of the relevant Slack thread (ignore the earlier parts of that thread).
cc @edrumwri for thoughts
I suspect this is due to the calculation of "next update time" for the two periodic events differing by roundoff error when they should be coincident (e.g. small steps were observed at 0.15s, .3s, .35s). Ideally we would consider these events simultaneous and avoid taking the ridiculously small steps.
Some thoughts on fixing this odd behavior:
I would be keen to see a standalone example here. The hypothesis sounds very believable, but when I ran a few 200 Hz vs 60 Hz double steps on paper, I didn't see any epsilons. Maybe I screwed up the math, though. In any case, a small test would provide something easy to iterate on.
that would be like declaring that Drake will consider events separated in time by some small tolerance are actually simultaneous.
There's a subtlety I'd like to emphasize. Imagine your window is Δ. I'll define δ = 0.9999 Δ . Imagine I have a sequence of events [0, δ, 2δ]. Each successive pair is "separated in time" by the same small tolerance, but the first and third events are not*. When you talk about this kind of strategy, we need to recognize it's not a transitive relationship.
I think what we really mean is that as we accumulate the events, we want to track all events in the interval [t₀, t₀ + Δ), where t₀ is the earliest time of all events declared. That would gaurantee we'd get events at 0 and δ, but not 2δ.
Hmmm, that's a good point. It seems unsatisfying in the (0, δ, 2δ) sequence to say that events (0,δ) are simultaneous but events (δ,2δ) are not even though both sets are exactly as close together in time.
I think we want to engineer a transitive solution. Not sure how, but there is a Principle of Least Astonishment issue to address here!
cc @rpoyner-tri since you've been in this code most recently
p0 = 1 / 200.0
p1 = 1 / 60.0
for i in range(20):
print(i, p1 * (3 * i) == p0 * (10 * i))
(0, True)
(1, True)
(2, True)
(3, True)
(4, True)
(5, True)
(6, True)
(7, False)
(8, True)
(9, True)
(10, True)
(11, True)
(12, True)
(13, True)
(14, False)
(15, True)
(16, True)
(17, True)
(18, True)
(19, False)
The transitive solution must inevitably allow for arbitrarily long windows. What we might say is transitive up to a certain point and then throw.
Hah! I didn't go far enough. I stopped at 5.
I was the same and then I got bored and made the computer do the work for me. So, I think we can say we have a good hypothesis!
As a foot note: here's the falses for the first 100 steps -- good luck predicting when it happens:
(7, False)
(14, False)
(19, False)
(23, False)
(28, False)
(33, False)
(37, False)
(38, False)
(46, False)
(51, False)
(56, False)
(61, False)
(66, False)
(69, False)
(71, False)
(74, False)
(76, False)
(79, False)
(87, False)
(92, False)
(97, False)
If the "simultaneity tolerance" is small enough we could consider full transitivity I think. For example say we pick 2ε (4e-16). If we had 100 of those in a sequence the first and last would be separated in time by 4e-14 seconds -- that's .00001 nanoseconds!
Another quirk (I edited in above) is that the Simulator must advance time to the latest of the "simultaneous" events.
That's fine. I'd distinguish between that specific application and the fact that nothing in the code would prevent a billion such times in sequence. If we feel it's reasonable to lump them together up to a point, then we should have a limit if we ever go beyond that point.
Transitivity + "Simulator must advance time to the latest of the "simultaneous" events."
The interval can become arbitrarily large. Events at the beginning of the interval cannot be overly sensitive that they're actually told to evaluate at the time at the end of that interval.
for i in range(20):
diff = abs(p1 * (3 * i) - p0 * (10 * i))
print(i, diff <= 4.4e-16)
(0, True)
(1, True)
(2, True)
(3, True)
(4, True)
(5, True)
(6, True)
(7, True)
(8, True)
(9, True)
(10, True)
(11, True)
(12, True)
(13, True)
(14, True)
(15, True)
(16, True)
(17, True)
(18, True)
(19, True)
Important question: as i
gets larger, when does that break? We should confirm that it still works for thousands of time steps.
The interval can become arbitrarily large.
I'm not sure how that could happen in a real application. Do you have a scenario in mind? Is there even a plausible buggy System that could generate an arbitrarily-long sequence of events separated by roundoff-noise time intervals?
Important question: as i gets larger, when does that break?
Good point -- the simultaneity rule needs to be relative, not just absolute.
Even at i=46 (t=2.3 seconds), we already cross that 4.4e-16 tolerance.
I wonder if allowing the leaf system to declare its frequency, instead of its period, would make the numerics work out better.
Here is a version more like I would actually implement it, with relative tolerance at large times:
p0 = 1 / 200.0
p1 = 1 / 60.0
atol = 4.4e-16 # absolute and relative tolerance
maxadiff = 0 # max absolute difference
maxrdiff = 0 # relative error at maxadiff
maxtime = 0 # time at maxadiff
maxi = 0 # value of i at maxadiff
for i in range(1000000):
t1 = p1 * (3 * i)
t2 = p0 * (10 * i)
tol = atol * max(1, t1)
diff = abs(t1 - t2)
if diff > maxadiff:
maxi = i
maxadiff = diff
if t1 > 0:
maxrdiff = diff / t1
maxtime = t1
if diff > tol:
print(i, t1, diff)
print("max i,time,adiff,rdiff=", maxi, maxtime, maxadiff, maxrdiff)
The output is:
max i,time,adiff,rdiff= 699054 34952.7 7.275957614183426e-12 2.0816582450521495e-16
That is, at t=35000s it considers events to be simultaneous if they occur within .007ns of one another, which is a relative difference in time of one part in 5e15.
Per a creative Nimmer idea, imagine if the visualizer ran at 64Hz instead of 60Hz. In that case there would be no roundoff error. I modified the above program to do that (replace 3,10 with 8,25). Here's the output:
max i,time,adiff,rdiff= 0 0 0 0
!
I'm not sure how that could happen in a real application.
Proof by incredulity? I can't believe it can happen, so it can't?
All I know, is that the concept you're proposing has a particular undesirable property. Betting that no one will run up against that property seems optimistic. I'd have to be persuaded that testing the assumption hasn't been invalidated is too expensive to do in practice before just relying on outcomes being "improbable".
Last question: so, 60 Hz and 200 Hz have a particular pattern. Why don't we have to worry about other combinations of frequencies? Tuning for this combination hardly feels like a true solution.
My suggestion on slack of playing games with the rates was in the context of a hack to quickly mitigate the problem in one particularly critical simulation of interest.
I do think this issue is real and fundamental, and we should solve it in a robust way, for a wide range of rates.
Counterargument by sarcasm isn't that convincing either :)
Lots of bad things can conceivably happen (unreasonably large vectors being normalized for a recent example). We aren't obligated to catch them all, especially when doing so is difficult (and here we would have to invent a cutoff interval, justify it, and document it). If we don't do that, we can simply say that events that occur within, say, a fraction of a nanosecond of one another will be considered simultaneous, with justification being that we support robotics applications and not molecular ones. I don't have a strong objection to putting in some kind of bugcatcher but it doesn't seem necessary to me. We could wait until the problem occurs in practice before trying to fix it; I speculate that that will never happen.
Not sure what you mean by optimizing for 60 & 200 Hz. The proposed fix is agnostic to the frequencies. Can you elaborate?
Marginally relates: https://github.com/RobotLocomotion/drake/issues/10497#issuecomment-833818637
It wasn't sarcasm. It was another form of irony: "hyperbole". :) That said, while I agree we can't put fingers in every hole in the dike, introducing this concept of false transitivity is poking a hole intentionally. It seems it should come with a bucket by design.
Optimizing for 60 Hz and 200Hz
The flavor that is occupying my mind is adjacent to the concept of things being relatively prime. In your code example, you used a tolerance of 4.4e-16 scaled by how far away the value was from zero. IT worked for frequencies 200 and 60. Could that be related to the common factors of 200 and 60? Does it get better if the common factors are larger? Smaller? Uglier? Of is the tolerance appropriate for all possible prime factors and it's just the frequency at which the tolerance kicks in that changes? I'm asking because my intuition doesn't give me a confident answer.
I can imagine the larger the smallest(largest?) common factor, the bigger the possible error between two otherwise "identical" values of t. But that "largeness" might already be consumed by the scale factor you have for computing relative error.
cc @edrumwri for thoughts
I'll remind you gents that I only check Drake's github notifications every few days. If you want a same-day response from me, ping me over Slack or email.
Sherm, your early analysis tracked exactly with my initial thoughts, and you covered one of my later thoughts too (relative vs. absolute tolerances). You guys have obviously spent quite a bit of time on the analysis at this point, so I don't have anything new to add except some philosophy.
I'd like to understand the Drake team's concern here- is it one of efficiency (the small step), one of hard-to-explain behavior (the sometimes simultaneous fire), calculations that don't give an expected result (the simulation trace looks vastly different for small deltas in frequencies), or some combination?
I might think the most surprising thing here from a user perspective is that the frequencies are integral, so 200 Hz and 60 Hz might be expected to collide at integral intervals. Except that to see the least common multiple, you have to turn those into periods, at which point it's obvious that the 1/60 does not give a precise representation in binary (or base 10). At that point, "it's floating point arithmetic - duh!" So, should we expect the user to understand so little about floating point representations that we do something sophisticated to compensate? At least Sherm and I both seemed to be able to diagnose the problem immediately (and I expect that everyone else in on this discussion got it pretty quickly too).
If the primary motivation here is "solving" the principle-of-least-astonishment problem, I'm worried that the result will be a lot of code with a tricky solution that might be better addressed by documentation alone.
FYI, we recently hit this @joemasterjohn.
We do understand the problem now.
However, we were surprised by this and, in a diagram with lots components it was a challenge for us to dig deep down what system was responsible for introducing a period slightly different from the one introduced by the plant.
I think I agree with Evan here, attempting to fix this might lead to an even darker and much harder to test code.
However, it'd seem in these cases these tiny steps are in the order of machine epsilon? would it be worth to at least emit a one time warning (per system/update?) when a machine epsilon time step gets taken? that way the user can quickly find out systems with say, period 0.01 and 0.05 and then take a decision on whether to change them or not.
I agree that making this easier to diagnose would be a nice benefit.
If the typical cause of this symptom is systems with periodic updates (rather than, e.g., witness functions), then we can compute from the Diagram itself up front whether there will be 1-ulp steps in its schedule, by taking an inventory of the set of periods and doing some math. We would offer that report as some helper function of a Diagram. It could specifically point out which system names and periods cause various problems. I like doing it statically and comprehensively across the entire update schedule, instead of waiting for runtime and just catching the first one.
I also wonder if a more helpful fix overall would be to offer an API to declare a periodic event using its frequency (i.e., 10.0 Hz or 100.0 Hz) instead of the period. I think we might be able implement that in such as way that we avoid ulp-steps. Most of the problems here come from people using say 0.01
and assuming that means 100.0 Hz, when in fact it does no such thing.
I have a simulation that runs two periodic events, one at 200 Hz (the controller), and another runs at 60 Hz (the visualizer). These two events collide at 20Hz.
With RK3 integrator, sometimes I observe the time step in
DoStep
function is tiny, in the order of 1E-17. This happens when the two events collide (but not every time they collide). @sherm1 thinks this is because of the roundoff error when computing the timing.To reproduce the error, add the following lines to simulator.cc
And run the anzu simulation with the following steps
The simulation will print out some message onto the screen, including the simulation time, simulation step and the triggered event.
@sherm1 @jwnimmer-tri