Simulation slows considerably as we add more RGB-D cameras

RussTedrake commented 5 years ago

Perhaps the most glaring performance issue in our current simulation pipelines is the significant degradation of performance as we add multiple RGB-D cameras (still via dev/SceneGraph).

Opening this issue now to begin the conversation. I believe sufficient evidence/benchmarks of this can be seen directly in the ManipulationStation example. @sammy-tri -- might you provide some numbers, and/or specific reproduction cases?

Possible related to #10383 (it's one of the reasons that I started thinking again about parallelization), but probably this is a more special case than that.

SeanCurtis-TRI commented 5 years ago

This is known (and regretted). I can think of a number of implementations that are not complete that directly contribute to it (although to what degree is still unknown).

I'm going to use the examples/scene_graph/dev:bouncing_ball_run_dynamics as a basis for measuring progress. PR #10389 is how I'm doing it (and it would love a quick dev-style review if anyone is up to it.) But using that as a basis reveals the following:

| images | fps | sim_time(s) | multiplier | note                                           |
|--------+-----+-------------+------------+------------------------------------------------|
| NONE   | n/a |       0.488 |          1 | Baseline                                       |
| C      | 0.1 |       0.929 |    1.90368 |                                                |
| C      |   1 |       1.166 |   2.389344 |                                                |
| C      |  10 |       1.585 |    3.24795 |                                                |
| C      | 100 |       5.488 |     11.246 |                                                |
| D      | 0.1 |        0.97 |      1.988 |                                                |
| D      |   1 |       0.915 |      1.875 |                                                |
| D      |  10 |       1.497 |      3.068 |                                                |
| D      | 100 |       5.685 |      11.65 |                                                |
| L      | 0.1 |       0.946 |      1.939 |                                                |
| L      |   1 |       1.053 |      2.158 |                                                |
| L      |  10 |        2.33 |       4.77 |                                                |
| L      | 100 |      13.271 |      27.29 | Label image are obviously most expensive       |
| CDL    | 0.1 |       1.126 |       2.31 |                                                |
| CDL    |   1 |       1.269 |        2.6 |                                                |
| CDL    |  10 |       3.367 |        6.9 |                                                |
| CDL    | 100 |      21.939 |      44.96 | The total time is *slightly* less than the sum |

That last row (where we render all three images at 100Hz) provides a multiplier of 45X. That's pretty egregious. While 100 Hz may be high, I suspect that even as it drops to 60, 30, 15, or even 10Hz, that scene complexity will more than make up for it.

RussTedrake commented 5 years ago

Is multiple outputs from a single camera really comparable to multiple separate cameras? I'm less worried (so far) about the relative cost of labels, etc. And more concerned that rendering two different views of the same scene should probably setup the e.g. vtk world only once?

SeanCurtis-TRI commented 5 years ago

Thanks for expressing that -- I'll make sure I add that variable to my evaluation.

That said, the only real overhead between separate cameras should be the update to the modelview and projection matrices. I would expect that whether the results are spread over three cameras or one camera, we'd see the same results. I'll let you know.

RussTedrake commented 5 years ago

Just to finish the thought -- the label images are actually not being requested in the pipelines we're trying to optimize first. I would not prioritize optimizing them. In fact, the most common/important use case currently is to pull on only one RGB image, and many depth images.

SeanCurtis-TRI commented 5 years ago

I suspected as much -- good to get the official word.

SeanCurtis-TRI commented 5 years ago

In my simple experiments, I'm not seeing a particular performance problem.

I put in 16 independent cameras, all rendering depth cameras. I did various experiments to gauge the cost of the various components of the rendering pipeline:

Total rendering pipeline: 100% (by definition)
VTK rendering: 36%
- This is all of VTK's internals which eventually invoke OpenGL stuff.
Overhead: 26%
- This includes dependency graph maintenance and various framework and scene graph costs that scale with the number of cameras.
Depth conversion on CPU: 23%
- The depth data is encoded in one way on the GPU and translated in the CPU.
LCM image processing: 9.4%
- THe process of converting the rendered images to LCM messages
Read back from GPU: 5.0%
- The cost of reading the image from the GPU into main memory.
Pose updates in render engine: .45%
- Evaluating the poses of all the moving objects and updating the render engine to those poses.
Camera update: 0.080%
- The cost of having multiple cameras (each with their own positions) connected to a single renderer.

Notes

Percentages are one thing -- actual performance is another. On my Puget, if I look at the number of renderings created vs the total time attributed to the full rendering impact, I get an effective 184 frames per second of data (or 5 ms per image).
- Component-wise: the VTK renderer is cranking through at something like 500 fps, so I don't think there's much we can/should do there.
If we lump the OpenGL/VTK rendering with the CPU conversion, that totals 59% of the computation. I fully believe that we can do a better job on the CPU to reduce or eliminate the CPU portion of that. That has the potential of bumping things up to 241 fps rendering throughput.
Some of the smaller portions (related to system framework) will improve when SceneGraph is made cache friendly. But that won't affect more than a quarter of the total cost leading me to predict no more than a 5-10% reduction in cost.
It is incredibly important to turn off the per-step publishing on the simulator (Simulator::set_publish_every_time_step()); this will kill performance.

Caveats

Nothing was textured.
The cameras, while possessing different extrinsic properties had the same intrinsic properties.

Both of those could (to varying degrees) increase the rendering cost. If you feel your problems are arising in a significantly different environment, I can take a look at it particularly.

sherm1 commented 5 years ago

It is incredibly important to turn off the per-step publishing on the simulator (Simulator::set_publish_every_time_step()); this will kill performance.

That's fascinating! Let's discuss f2f (cc @edrumwri). That might be due to a DoPublish() override that could be replaced by a more-precise Event+callback specification.

SeanCurtis-TRI commented 5 years ago

To quantify that particular topic, I did another test. For the toy simulation I've been running:

No cameras (publish per step = True): 4.5 s With 16 cameras (publish per step = False): 11 s With 16 cameras (publish per step = True): 78m51s

This is with the cameras publishing at 10Hz and an RK3 integrator with a maximum time step of 2 ms.

sherm1 commented 5 years ago

Changing the default for Simulator per-step publish from true to false would produce a 400X speedup here in the default case! See #10428.

tri-ltyyu commented 5 years ago

The set_publish_every_time_step flag is set to false in the Anzu sims that we ran. Do you want to try running the clutter example where we're seeing the performance issue? Courtesy of Siyuan, you can run this command after build. For comparison without camera, you can remove '--publish rgbd ./run //apps/simulations:clutter_simulation --station_name=prawn --publish_rgbd

SeanCurtis-TRI commented 5 years ago

Thanks. I'll take a look. But it will have to wait. I'm not updated to Bionic yet. That should happen next week.

tri-ltyyu commented 5 years ago

Simulation of Clutter with RGB-D camera has gone from 2x RT to 0.3x RT.

SeanCurtis-TRI commented 5 years ago

tl;dr I've played with the clutter demo and seen similar issues. I'm sufficiently convinced that things can be better that I'm going to spend some time doing direct OpenGL rendering (cutting VTK out of the loop).

Results of clutter simulation performance analysis

In my investigation, I decided to add some high-resolution profiling instrumentation to my render pipeline. Previous experiments used callgrind to assess relative cost. Now I have actual measurements to nanosecond precision.

Here's some preliminary results (on the clutter simulation scenario):

Camera Count	Sim Time (s)	Depth (s)	Color (s)	Render Time (s)	Render Overhead (s)	Render FPS	Effective FPS
0	3.16	0	0	0	0	0	0
1	7.87	1.98	1.76	3.73	0.98	161	128
2	11.7	3.63	3.12	6.75	1.82	178	140
4	19.2	6.98	5.52	12.5	3.50	193	150
8	36.3	14.9	11.1	26.0	7.15	185	145

Notes

Column explanation:
1. Camera Count - the number of cameras added to the scenario (zero cameras --> zero render-related infrastructure).
2. Sim Time - the time, in seconds, to evaluate StepTo(10); This ignores all initialization of the scenario, context creation, etc.
3. Depth - the time, in seconds, to invoke QueryObject::RenderDepthImage32F().
4. Color - the time, in seconds, to invoke QueryObject::RenderColorImage().
5. Render time - sum of Depth + Color.
6. Render Overhead - this is the overhead of having rendering -- this is system overhead (earlier indications suggest this may be cache overhead).
7. Render FPS - a measure of the rendering throughput based on the total number of images rendered and the total Render time. This represents the peak framerate if the overhead could be 100% eliminated.
8. Effective FPS - a measure of the rendering throughput including the overhead.
The values in the columns are the average of ten runs for each number of cameras.
The total simulation time for each row should always be the sum of: the render time, the render overhead, and the sim time with zero cameras (this assumes that the work for simulation that isn't rendering related is fixed w.r.t. rendering).

Observations

The effective frame rate is the dominant factor in the slowdown. We've asked the cameras to render two images each at 30 fps. So, 8 cameras are requesting 480 renderings per simulation second. At a generous 150 fps, that's more than 3:1 ratio between compute time and simulate time (in contrast to 1:3 ratio for dynamics).
I am highly suspicious of the Render FPS. I fully believe that we should be able to do a lot better than that as a baseline. In an informal discussion with the driving team, they report frame rates 4-5X higher than this. They are living much closer to OpenGL than we are with VTK. There's possibility that we're paying a cost for VTK's generality that is not justified for an otherwise trivial OpenGL renderer.
1. This same level of performance were observed in the toy bouncing ball scenario. I'm going to do a simple test to assess VTK's cost -- I'm going to implement a close-to-OpenGL renderer that can do the bouncing ball demo in drake and see what performance that gets. I suspect we should be able to get much higher frame rates.
The overhead render time is growing linearly with the number of cameras. So, at least there's no accidental explosion in infrastructure.

SeanCurtis-TRI commented 5 years ago

For all interested players -- this is your chance to declare, "No! Don't spend time writing a simple OpenGL renderer!"

jwnimmer-tri commented 5 years ago

My rough view of current priorities is that we care only about performance of the depth simulation. If the clutter benchmark has the color channels enabled, we should probably revise it to disable them.

SeanCurtis-TRI commented 5 years ago

That'll certainly reduce the cost. :) Won't quite double performance, but it'll bump it by a good ~75%.

avalenzu commented 5 years ago

I concur with @jwnimmer-tri.

avalenzu commented 5 years ago

@siyuanfeng-tri can you comment this?

tri-ltyyu commented 5 years ago

Working on OpenGL renderer to see if performance improves. We can turn off RGB in manipulation station to see if that helps.

SeanCurtis-TRI commented 5 years ago

I've got a working close-to-OpenGL renderer working sufficiently to determine the partial cost of VTK's overhead.

Results

I ran the clutter simulation in several variations (for 10 seconds of simulation). The time results are the average result of three iterations:

Variant	Camera Count	Sim Time (s)	Realtime Factor	Render Time (s)
baseline	0	3.62	2.76	0.00
Depth + color	8	37.76	0.26	26.01
Depth	8	21.35	0.47	13.28
Fast Depth	8	14.28	0.70	6.38

Column explanation:

Variant: perturbation of rendering configuration
Camera Count: number of cameras
Sim Time: the time to actually simulate 10 s of simulation (i.e., time to invoke StepTo(10)).
Render Time: the total time spent in calling QueryObject::Render*Image()

Discussion

As predicted, just turning off the color made things run faster (0.26X to 0.47X). Check.
The close-to-OpenGL renderer is faster. It's essentially twice as fast (180 fps vs 380 fps).
85% of the fast depth render time is spent reading data back from the GPU. (It's impossible to explicitly quantify this result in the VTK-based renderer.)
The rendering (including read back from GPU) accounts for less than half the cost of the rendering in this scenario. Previous forays into examining the cost of cameras suggest that this is framework/cache related.

What's next?

It seems "finalizing" the fast camera makes sense. This can take a couple of slices
- Finish of the depth camera so it fully supports everything the full api for depth only.
- Add the rgb and label images to the renderer so that it's a complete API.
We need to examine what the system overhead is; 55% of the cost of having cameras isn't rendering.
Explore means for pipelining rendering/readbacks. I don't know how much faster things can get if we get smarter about reading back the rendered images, but that's clearly the vast majority of the cost.
- Option 1 (far off distance): add parallel execution of independent cameras. Modifications to the system framework.
- Option 2 (more accessible): Add a "multi-camera" system. Essentially, this would be a single system whose calc method is parallelized to render the cameras in parallel. Calc'ing one image calc's them all (but after that they're in the cache). This would only require changes to SceneGraph (to support simultaneous rendering) and the actual MultiCamera system.

I'm lobbing this back into the shareholder's court to determine what path forward they'd like to take.

siyuanfeng-tri commented 5 years ago

This is pretty awesome! I am more concerned about the time lost to piping / system overhead / whatever it is. Even as is, I think 0.7rt is pretty tolerable for dev without the parallel execution addon.

loping in @calderpg-tri @thduynguyen as well.

SeanCurtis-TRI commented 5 years ago

I agree -- tackling the infrastructure cost is the next optimization pain point. But first, I'll need to finish off the prototype to make it available.

jwnimmer-tri commented 5 years ago

... results are the average result of three iterations.

Just a small nitpick: when measuring the effect of changes of code or configs on performance, in general we should use the best result across several trials, not the mean. The best result reflects what the code is capable of. To the extent other trials are slower, it means your computer did something wrong, not that the code got slower.

jwnimmer-tri commented 5 years ago

I would be interested to see a plot for the Depth and Fast Depth series for Sim Time and Render Time (stackbar) where num_camera is swept from 0 to 20 by 1-camera increments. Knowing how the line trends would probably help us choose which facets we care about, and a path forward. It would also help if you could publish a benchmarking branch (even if just for Depth) so others could confirm your results and/or help quantify the bottlenecks.

jwnimmer-tri commented 5 years ago

It seems "finalizing" the fast camera makes sense.

Have you de-risked this technology with a macOS unit test yet?

SeanCurtis-TRI commented 5 years ago

To address your issues:

Mean vs Min: in this case the standard deviation is so laughably small, I don't think we've lost any information here.
I've done the data for cameras: 0, 1, 2, 4, and 8. The cost grows linearly with the number of cameras.
Nope. Not yet. And that's a good call. Part of the "finalizing" would've been to add the the depth-subset of standard unit tests we have for the renderers. I can add those and then push it into CI just as a reality check.

jwnimmer-tri commented 5 years ago

(On 1, that's not surprising, but all of the Q1 "improve performance" benchmarks that people are publishing are averages, and it's just bothering me. The statistics are not meaningful, unless we are randomizing something meaningful across trials. I'd be happier if we did it right, even if in this experiment it doesn't end up being different.)

Another thing to keep in mind: 8-12 cameras with a global, sync'd shutter are not exactly the physically realistic model, though possibly a reasonable compromise for performance for now. But in reality, each camera in the scene has its own timer. Also, the time from a camera shutter exposure to the LCM image message actually being received by the controller for processing is non-zero, and I wonder if your current benchmark is accounting for that -- i.e., do you have a delay line (or equivalent) between the camera snapshot and the code that uses it (probably, the LCM publisher)? Another way to achieve better image throughput might be to have a timed event on a (single) camera system kick off the render, but don't publish the result of that render until simulation time has advanced by some N micro- or milliseconds. If in reality if we have an, e.g., 2ms delay line for shutter-to-transmit, then your current 5ms wallclock render delay with the VTK camera is cut down to just 1ms wallclock (5ms wallclock - 2ms simclock / 50% realtime vs simtime rate). That leaves the door open to having desync'd cameras as part of a simulation, is more physically meaningful, and isn't that far away from reaching realtime if we can then nudge the single-camera rendering delay down just a little bit more.

SeanCurtis-TRI commented 5 years ago

There's a lot to unpack there.

(On 1, while I'm perfectly willing to report best observed times as a measure of performance improvement (assuming consumers of such report interpret it as such), it's difficult to see that one is any "righter" than the other. They both require context to understand the number.)

I read and pondered your proposal multiple times. Ultimately, I came to the following conclusion: I believe you're suggesting we exploit the real world latency to the simulation's advantage. Start the camera render some small delta before the consumer would pull it -- this models the fact that the image one gets and the state of the world has some latency and gives us some time, computationally, to "pre-render" in parallel. Do, I have that right? In other words, this targets reducing the latency between requesting an image and receiving an image (and happens to also model a real-world phenomenon).

Everything that follows assumes I've understood you. So, if I'm heading in the wrong direction, you can ignore the rest in favor of correcting my misapprehension.

As a proposal for reducing the latency of getting an image, this seems reasonable. However, an end user should be able to model an idealized system and model zero latency.
I'm slightly dissatisfied with conflating modeling and performance. Modeling an idealized system (using the proposal) will throw out any performance improvements. That seems a shame. I recognize that optimizing an idealized system may, in fact, require a different mechanism. I'd like to think we'll have a plan for that as well.
Now, thinking aloud, this would be a camera system with a pair of periodic discrete update events, right? One that kicks off the render (storing a std::future-type of thing in its discrete state) and the second that updates its discrete image that gets reported dumped to the output port, yes? And the camera latency is a parameter of that system that is the phase shift between the two periods.

Also, tangential to my comprehension of your proposal, it seems there's also an implied preference for keeping the VTK-based renderer, is that correct? Intentional?

sherm1 commented 5 years ago

FWIW I think it is not a good idea in Drake to attempt to exploit real-world delays to reduce runtime latency. That requires that the simulated system is intentionally different from the physical system and requires an awkward mix of simulated and wallclock times. Drake should stick to modeling the real system, including the real delays, but always operating in virtual time. The "exploit real delays to reduce simulation latency" trick is great for networked games (I've done it for a virtual world I used to work on) but IMO is not a good match to a high fidelity engineering simulation.

(Reading the exchange above I'm not sure Jeremy was suggesting that.)

SeanCurtis-TRI commented 5 years ago

Let me clarify what I meant.

We're not exploiting real delays. We're exploiting the fact that we want to model things with real delays and make sure we model that in a computationally advantageous way.

jwnimmer-tri commented 5 years ago

(On 1, while I'm perfectly willing to report best observed times as a measure of performance improvement (assuming consumers of such report interpret it as such), it's difficult to see that one is any "righter" than the other. They both require context to understand the number.)

Reporting a statistic (mean) is more wrong, because the experiment is not designed to be a statistical measure. It misleads readers into thinking the variance (or percentiles, or error bars) are important, but really they are only things like "Google Chrome decided to steal your CPU to show some ads", not useful statistics.

I agree the reader has to understand the experiment anyway in the first place, but there's absolutely no reason to make it more confusing by using meaningless statistics.

If we wanted to randomize object poses and take an expectation of runtime across varied poses, then I'd want mean (and error bars), but for re-running the same identical simulation multiple times, there are no statistics.

I believe you're suggesting we exploit the real world latency to the simulation's advantage... Do, I have that right?

Yes. And to Sherm's point, we would (nominally) limit the delay line to match the actual sensor's delay (which includes firmware processing time, encoding, USB transmission, ethernet transmission, kernel wake-ups, receiver decoding, etc). I'm not sure if we know yet if that's 100us or 5ms, but we could figure it out and exploit it.

However, an end user should be able to model an idealized system and model zero latency.

Agreed. And for now, perhaps that means maybe their realtime factor struggles.

I recognize that optimizing an idealized system may, in fact, require a different mechanism. I'd like to think we'll have a plan for that as well.

Fair enough. My post was more about making sure we're asking the right questions and considering all solutions, than to limit us to only using the delay line trick.

This would be a camera system with a pair of periodic discrete update events, right? One that kicks off the render (storing a std::future-type of thing in its discrete state) and the second that updates its discrete image that gets reported dumped to the output port, yes? And the camera latency is a parameter of that system that is the phase shift between the two periods.

That would be one way. At a minimum though you only need the first event. If you store your state as two <image-future, time-the-future-should-appear-on-output> pairs (of which only one is updated each event trigger), then CalcOutput can decide which of the two images in the state is the correct one to emit. Probably using a second event for the "delay has ended" event (instead of having Calc depend on time) would better exploit the cache, though.

Also, tangential to my comprehension of your proposal, it seems there's also an implied preference for keeping the VTK-based renderer, is that correct? Intentional?

My intention is to push us to focus on the two things that matter the most in the medium term: pipelining and multicore. We can get 2x improvement by rewriting the render form VTK to GL, but we can get 10x or 100x by correctly staging and distributing the compute -- maybe even use a CPU-based raycast, etc. Staging applies equally well to VTK, GL, or a user's custom renderer in Anzu.

The idea that troubles me the most is "single System with multiple camera ports". That's a way to get a weak form of pipelining (by batching the renders to the GPU), but I'm not sure it will deliver the lasting value we want. At some point, your CPU is still at zero while we wait for the GPU. Maybe if the multi-camera System is easy though, it can be a cheap near-term band-aid.

I think adding a GL render is fine. I think it's just going to be a lot of work to get it to a point where we trust it and it works on all three OS's we support with all the kinds of GPUs that are scattered about, even for just TRI's own computers.

And at some point we will want RGB again, and for it to have lighting, etc. You'd know better than I, but I suspect that path via OpenGL will not be fun, and slightly easier in VTK? (Or at least, easier to outsource?)

jwnimmer-tri commented 5 years ago

... two things that matter the most in the medium term: pipelining and multicore.

(Well, hopefully it goes without saying, but correctness as usual is still the most important. I'm kind of taking that a as given at this point.)

SeanCurtis-TRI commented 5 years ago

I agree with all your points that matter for me to agree with. Some remaining thoughts:

Finishing off an low-level OpenGL renderer to replace VTK is not difficult. It's not a game engine. It's geometry with phong lighting. A very tractable piece of work. My gut says the VTK overhead justifies finishing it.
One of the big questions for me inherent in this issue is "how much improvement do we want in what time frame?" My work/comments here naively assumed a quick turnaround was required (combined with some real dissatisfaction in the observed frame rates). And there seem to be multiple OKRs very directly phrased as such (although mine is merely a derivative of what I believe exists out there already). If we have more time and want to explore alternatives, I'm completely copacetic. I consider you guys the customer and my time the resource -- you decide how you want to spend it. So, we need to have a concrete discussion on what affect we want to achieve by what time and what we think it will cost.

jwnimmer-tri commented 5 years ago

(2) My 2c is that with some urgency we should finish the minimum numbers requested (8-12 depth cameras at 0.5x realtime). "Turn off color" get us to 8 at 0.5x. It sounds like the GL renderer is the best way to get up to 12 at 0.5x. Once that's ready, I would say we focus on features and usability instead of performance, so more like #10675, geometry queries, removing false SceneGraph state, or whatever else is a priority.

tri-ltyyu commented 5 years ago

Based on meeting, we determined we would like to finish low level OpenGL renderer work to get the required performance.

SeanCurtis-TRI commented 5 years ago

Update: The PR didn't go in today. I hit some subtleties for making multiple instances play nicely together. However, the PR should go in early next week.

SeanCurtis-TRI commented 5 years ago

This hit a speed bump: it has proven nigh on impossible to provide mac support for the OpenGL renderer.

The path forward is:

Modify dev/SceneGraph to reflect the future API in which renderers are externally instantiated and passed to SceneGraph. (This is about 90% done and will be PRd imminently).
- @EricCousineau-TRI this should make your godot efforts much simpler.
Move the OpenGL renderer into an external repo to be exercised in domains where mac support isn't necessary.

RussTedrake commented 5 years ago

I thought the plan was to fallback to vtk on mac and keep opengl rendered in drake for linux?

SeanCurtis-TRI commented 5 years ago

True; but the OpenGL renderer can't even build under Mac. So, rather than put contortions into the build system and #ifdefs all over the code to account for mac, @jwnimmer-tri felt that this direction was going to be saner.

On the plus side, it allows me to leak the actual API that'll land in mainstream geometry so @EricCousineau-TRI can work against that instead.

jwnimmer-tri commented 5 years ago

I think that once we have unforked SG and once we have baked the optimized GL depth renderer in Anzu for a while until it's stabilized, then we could look at moving it to Drake master as an Ubuntu-only module.

EricCousineau-TRI commented 5 years ago

On the plus side, it allows me to leak the actual API that'll land in mainstream geometry so @EricCousineau-TRI can work against that instead.

Yes, please!

I think that once we have unforked SG [...]

+1 to this as well. I'm not a huge fan of conditional compilation, but we'll have to do it for ROS1 ~/ ROS2 support at some point~ if ROS2 does not serve our goals in Anzu.

tri-ltyyu commented 5 years ago

All pieces to support faster simulation in clutter are available. Letting manipulation team who needs performance to update. Closing this ticket now and manipulation team/other users can re-open if simulation speed is not as expected. @calderpg-tri and @avalenzu

RobotLocomotion / drake