RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.17k stars 1.24k forks source link

Reduce the noise in meshcat visualizer realtime rate reports #21534

Open SeanCurtis-TRI opened 3 weeks ago

SeanCurtis-TRI commented 3 weeks ago

Adds a new API Meshcat related to realtime rate. Rather than having clients compute the realtime rate themselves (and immediately broadcasting it to all clients), clients simply report to Meshcat that sim time has advanced.

Meshcat uses the report of time advancement to throttle the report of real time rate (and computes the real time rate itself based on the amount the simulation has advanced since the last broadcast message).

Also changed the display of the realtime rate -- rather than streaming values as fast as it can, it only changes when it receives a message doing a better job of giving the user feedback on how frequently messages are coming through.

Uses the new API to enable meldis RTR simply.

Resolves #21512 Resolves #17661


This change is Reviewable

SeanCurtis-TRI commented 3 weeks ago

Note to reviewer: You can see the effect of this by running hardware_sim, e.g.,

bazel run //examples/hardware_sim:demo_cc

It launches its own Meshcat server so you can view it directly (with the default parameter).

If you launch meldis, you can confirm the same effect in meldis.

bazel run //tools:meldis

Meldis also offers a simple path towards playing with the publish period. Create a file named, for example, rate.yaml. Its contents are simply:

realtime_rate_period: 0.01

Now run meldis as:

bazel run //tools:meldis -- --meshcat-params=./rate.yaml

and you can see the real time rate graph update faster and more noisily. Make it a large number and you'll see it slowly fill up.

jwnimmer-tri commented 3 weeks ago

It's very clear to me that:

(1) Having MeshcatVisualizer compute RTR is absolutely wrong and should be burned with fire. (2) Instead, Meshcat should offer a non-const time-update function that accepts double current_sim_time (however we spell it).

The only undecided question is what math or smoothing we do inside meshcat.cc in response to the new call to (2).

My hope for exploring the new math would be to try out various math policies (like this one) against real world simulations. I can do that by implementing the "centralized the math" refactoring in a local WIP and then trying more math WIPs after that, but I thought it might accelerate both of us to get the "centralize the math" onto master so we can both play with new ideas easily, with just a couple more surgical changes. It would also get the new deprecation into the next release coming up in a couple days.

Not a big deal either way.

SeanCurtis-TRI commented 3 weeks ago

Before I delve into the proposed semantics (see below), there's some context in your mind I'd like to get a view on. How would you characterize the purpose of the stats window? Should it answer some specific questions? Should it give intuitive insight? I can imagine any number of purposes for such a chart, and each purpose would probably suggest changes to what is shown and how it is shown. So, I'd like to get a sense of how you're implicitly defining a successful chart.

And now some thoughts on details

how much wall time should the chart represent, overall, on its x-axis

My primary concern here will repeat below. Render performance. We can't update any faster than the animate loop runs. We have to make assumptions about its speed (60 fps if healthy) to achieve a wall-clock-base criterion. If our fps slows down, then the goal won't be met. We could attempt to accommodate for things on the javascript side by adding multiple vertical bars if the elapsed wall clock time has exceeded our target time. But that puts timing games on both the client and server side.

the area under the curve should equal how much sim time occurred during that window.

This is problematic for a few reasons, as I see it:

One choice is to not care about bad render performance; chalk it up as an extrinsic problem that needs to be fixed before the user can use the visualizer reliably.

SeanCurtis-TRI commented 2 weeks ago
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
> We can't update any faster than the animate loop runs. Good point, I hadn't considered this. However, is there any reason we need to have the call to `realtimeRatePanel.update` as part of the animate loop? It seems like we could call it from the message handler so that 1 message = 1 pixel, no matter what the animation is doing. In fact I tried that and it's already not awful, as best I can tell: ```patch --- a/examples/hardware_sim/example_scenarios.yaml +++ b/examples/hardware_sim/example_scenarios.yaml @@ -75,3 +75,6 @@ Demo: lcm_bus: driver_traffic wsg: !SchunkWsgDriver lcm_bus: driver_traffic + visualization: + publish_proximity: false + publish_inertia: false diff --git a/geometry/meshcat.html b/geometry/meshcat.html index a34fd2555e..2ac1a07334 100644 --- a/geometry/meshcat.html +++ b/geometry/meshcat.html @@ -27,7 +27,6 @@ // We want to show the realtime rate panel by default // it is the last element in the stats.dom.children list stats.showPanel(stats.dom.children.length - 1) - var latestRealtimeRate = 0; var viewer = new MeshCat.Viewer(document.getElementById("meshcat-pane")); viewer.animate = function() { viewer.animator.update(); @@ -83,8 +82,6 @@ function animate() { stats.begin(); - // convert realtime rate to percentage so it is easier to read - realtimeRatePanel.update(latestRealtimeRate*100, 100); viewer.animate() stats.end(); if (gamepads_supported) { @@ -98,7 +95,8 @@ function handle_message(ws_message) { let decoded = viewer.decode(ws_message); if (decoded.type == "realtime_rate") { - latestRealtimeRate = decoded.rate; + // Convert realtime rate to percentage so it is easier to read. + realtimeRatePanel.update(decoded.rate*100, 100); } else if (decoded.type == "show_realtime_rate") { stats.dom.style.display = decoded.show ? "block" : "none"; } else { ``` Now basically the x-axis is calibrated to sim time (instead of a wall timer), and the y-axis still shows RTR. So it scrolls slower as the sim slows down. Possibly this is the correct UX? > Should it give intuitive insight? Basically yes. The immediate problem on master is that the chart is so jumpy and fast as to be totally unintelligible and nearly useless. So we at least need to fix that much, but after that I'd say that yes the chart should support an intuition about performance, i.e., it's axes and coloring have some underlying meaning or scale, whatever one of those we choose. So e.g. if we look at a user's screenshot of their Meshcat with its RTR chart, we have some idea about their rough performance profile. > ... would need to be spread over multiple columns (for the integral to work out). That seems problematic at the best of times. Any kind of moving-window filter or exponential smoothing filter has the same kind of effect, and people deal with those daily just fine. Also the statement isn't true in the first place -- we could push out a 0% RTR message when the sim slowed down beyond the chart's wall-time sampling period, instead of smearing that slower sim step across two wall samples. > The vertical axis in this chart would not be realtime rate, but sim time. I didn't mean to imply changing the axis at all. The y-axis is still RTR, both printed as text and in the vertical scale. (The idea of "area under curve = seconds" had an implied pixels-per-second scale factor or whatever to make the units work.) > ... than a single message between animate loop calls ... I think my patch above avoids that trap.

Multiple thoughts:

  1. Well, it's clear to me that you hadn't intended the vertical axis be sim time and the integral of the pixels not total sim. So, my concerns vis a vis that feature can happily be ignored.
  2. The proposal to pull the rtr chart update out of animate.
    • I was wondering if it would give weirdness as you look across the various charts.
      • looking at stats.js, its clear that the FPS and memory charts already update at different rates from the ms chart. So, having our rtr chart have a different x-axis is not a problem.

I've written stuff up in a google doc. Perhaps we can use that to iterate as our design document. There were some things I wanted to communicate for which this felt to be a dissatisfying medium.

SeanCurtis-TRI commented 2 weeks ago

the next question is to choose which UX we want for the x-axis.

If we're going to have the conversation here, can we specifically list the UX options? I'd like to make sure we've got the same idea in mind. It will go a long way toward informing the probing.

SeanCurtis-TRI commented 2 weeks ago

(1) RTR, given as a percentage. Yep. Sounds good.

(1b) Maximum value > 100%? I suspect there would be little value. I can't imagine someone attempting to visualize a simulation where they haven't already throttled the simulation to realtime. Certainly, for anything like teleoperation, I can't image they would do so. If it happens to run faster than realtime rate, the textual representation would still draw arbitrarily large numbers, so even if the bars saturate, the information isn't thrown out (and I believe it also shows the range of values that have been drawn by the chart). Finally, we don't currently label the y-axis. I can imagine sliding the top to 150% because we happened to have a single fast frame would easily lead to a mistaken interpretation of all of the other values.

(2) I'm still wrestling with this. In the write up, I distinguished between the pixel period and the report period (in order to capture the behavior I understood you to be describing). The first revision of this PR doesn't have that distinction. I'd still argue there's value in that approach. I suspect that discretely sampling a discrete sampling of time leads to more cost than benefit, but I'm open to being persuaded.

the same no matter if I have 1 MeshcatVisualizer or 10 MeshcatVisualizers in the scene. Agreed. Should we also have the expectation that two clients connected to the Meshcat should have the same chart (assuming they've both flushed their message queue)? Or can there be client-to-client variance?

I think a face to face would be best.

SeanCurtis-TRI commented 2 weeks ago

the two charts should not be meaningfully distinguishable (something like a 1-vertical-pixel roundoff error is probably fine).

Well, that seems to push the majority of the responsibility onto meshcat. I really don't like Meshcat being responsible for knowing the pixel layout of the webpage. How about if Meshcat sends a message to the browser, informing of it of its "pixel period" and the webpage determines the time interval of the chart and prints it. If we do that, then it would be sufficient for Meshcat to publish at only its report period.

I'm going to try something like that....

SeanCurtis-TRI commented 2 weeks ago

is x-axis as sim time a good UX?

I think it should be preferred.

If so, then we rm about 70% of this PR, add two lines of JS, and we're all done.

Yes and no. Your answer suggests something that seems concrete in your mind but isn't in mine.

Whether the pixel width represents sim time or wall time, you still have to know how much of the quantity it represents. It would be good to know that without recourse to a config file (which may have changed) or a transient command line or an obscure line somewhere in documentation referencing a hard-coded value that is not part of the stable API. I like the idea of understanding the scale of the x-axis easily.

So, how does a user know the scale of the x-axis? We could make Meshcat promise to publish a message at one sim-time period. Each pixel has a value equal to that period. However,Meshcat::SetSimulationTime() may be called at a different period, we still have to reconcile that aliasing.

Furthermore, even if Meshcat were to promise publishing regular values at some some fixed sim time period (independent of the period over which it computes realtime rate), to understand the scale of the chart's x-axis still requires knowing the number of pixels and I hate Meshcat having that knowledge.

SeanCurtis-TRI commented 2 weeks ago
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
> We could make Meshcat promise to publish a message at one sim-time period. Yes. The C++ => JS message period (measured in units of sim time) is a property of `meshcat.cc`, not anything else, and should be consistent across a whole simulation run. > ... to understand the scale of the chart's x-axis still requires knowing the number of pixels and I hate `Meshcat` having that knowledge. Right, so in this case it's 10 lines of JS instead of 3. The RTR message from C++ to JS would contain the message rate (period) as an extra field, and the JS side multiplies that by the number of pixels in order to put a textual label on the X axis explaining the scale. That would not necessarily need to be MVP, though. > However,`Meshcat::SetSimulationTime()` may be called at a different period, we still have to reconcile that aliasing. Probably so, though I think it's a lot easier than wall clock time.

So, if I can summarize on the things i think we agree about:

  1. The y-axis is RTR.
  2. The x-axis is simulation time.
    • Each column in the chart spans a fixed amount of simulation time. That value of that span (and, therefore the duration of the time spanned by the chart) may not be obvious in the MVP, but, behaviorally, it has a value and someone who knows the code could tell you what it is.
  3. There is a single Meshcat parameter related to this feature: the report period (amount of elapsed simulation time between RTR messages). The duration of a chart column depends on this value (and the number of pixels).
  4. Two clients that have received and processed the same RTR messages should have the same number of columns.
SeanCurtis-TRI commented 2 days ago
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
I wrote SleepSystem into the hardware_sim example and played things under various conditions. I have to say, I still really like x-axis as wall time. When x-axis is sim time and the sim bogs down, I find myself starving for data in the chart. Also when the simulator timing gets lumpy (i.e., unevenly slow) the wall-time chart reflects that irregularity more obviously (it's ragged), but the sim-time chart is still smooth (since in effect, it has fewer samples per bog-down to draw, so the ragged timing gets blended out). Do you have opinions on which axis choice is better?

I don't have any opinions. I can make arguments on both sides, but I've never really had a work flow that depended on the chart, so I don't have a visceral sense of what problem either solves. Having explored the options, I'm willing to stipulate that may feeling should be granted the least amount of weight.