Multi-threaded (or even multi-GPU) rendering?

peci1 commented 3 years ago

It seems to me (with my limited insight into OGRE/ign-sensors integration) that all sensors get rendered sequentially:

https://github.com/ignitionrobotics/ign-sensors/blob/67dbabc980102b96b2b0b3424c52c86c646a0c2e/src/Manager.cc#L105-L108

Is that right? Could that be the reason why SubT simulator runs so slowly with multiple models, all GPUs almost at rest and 4 CPUs spinning like hell (even if run on a 40-core machine)? It's easy to see that spawning a single EXPLORER_X1 robot decreases real-time-factor to about 10-15 %. The model has 4 RGBD cameras and a 3D GPU lidar. A discussion on this topic is here: https://github.com/osrf/subt/issues/680.

Is there a way to parallelize? I guess this would have to be somehow user-configurable because you can't just put all rendering tasks on the GPU at the same time... An environment variable could be used to let the user say his GPU is performant enough to do 4 rendering tasks in parallel?

Or would it even be possible to extend the parallelization to multiple GPUs? Until Ogre implements EGL rendering, that would mean running multiple X servers, or using VirtualGL (it supports EGL offload since version 3.0, which is currently in beta). I can imagine the user would pass a list of X servers, e.g. DISPLAY=:0,:1,:2 and sensor manager would uniformly distribute rendering tasks (or sensors) to the GPUs.

Or is there something substantial that would prevent any kind of parallelization (e.g. come scene locks?)?

iche033 commented 3 years ago

about the idea of parallelizng camera rendering, I'm not sure if OGRE supports this since there are orders of operations that must be done in sequence in OGRE to prepare the scene for rendering. If we try to make concurrent rendering calls, we usually run into problems locking hardware buffers and OGRE would crash. One workaround would be to distribute to multiple OGRE instances, each with their own GL context. But then there would be the tradeoff of sync'ing data between processes.

But before diving into this, I think it would help to profile (e.g. using ign profiler) and see where the bottlenecks are. On the other hand, I've noticed things like if the VRAM is full, the RTF drops to half. We also found issues with lights (with large range) causing performance hit. So there are also a few other places for performance improvement.

peci1 commented 3 years ago

I ran the simulator with EXPLORER_X2_SENSOR_CONFIG_2 robot and enabled remotery profiling for ign-gazebo and ign-sensors. Unfortunately, it seems that ign-rendering has no profiling support (no IGN_PROFILE calls).

Screenshot_2021-01-20 Remotery Viewer

Screenshot_2021-01-20 Remotery Viewer(1)

Screenshot_2021-01-20 Remotery Viewer(2)

Screenshot_2021-01-20 Remotery Viewer(3)

At least the last screenshot was taken when all of the sensors (4 RGBD cameras + GPU lidar) were subscribed and forced to generate data.

Generally, there are 3 time-consuming parts:

RenderingSensor::Render (this could in theory be parallelized)
PostUpdate in simulator
Sensors::PostUpdate

I guess 2 and 3 can very well be just waiting until the render thread notifies it has finished. I'm a bit unsure why the beginning of the timeslots for PostUpdate can be later than the rendering, but it definitely seems that both PostUpdate functions exit at the very same time rendering is finished.

Do you have an idea how to profile the rendering part? I saw OGRE might have support for remotery too, so I'll try looking in this direction.

mjcarroll commented 3 years ago

The rendering operation itself is actually allowed to run in parallel to the simulation continuing. The only requirement is that the rendering must be complete before the next rendering call can start.

The first iteration in ign-gazebo had rendering sensors blocking all simulation, which was introducing a huge amount of latency. This is in contrast to gazebo9/10/11, which allowed sensors to run freely in parallel with the simulation, at a potential simulation accuracy hit.

In cases where all of the cameras are at the same frame rate and syncronized, this provides some speedup by allowing the rendering to happen in parallel at the cost of a small bit of sensor latency. If all the rendering sensors are at different rates and starting times, the benefit is lost.

All that being said, rendering should be profilable, if built with the flag enabled. It may not have as many profile points, though.

mjcarroll commented 3 years ago

And for clarity, "in parallel" here means the rendering thread works in parallel with the simulation thread. Each rendering sensor that needs a frame generated will run in series in the rendering thread.

Some of the time in post-update may be read back from the GPU and serialization for ign-transport as well.

peci1 commented 3 years ago

Okay, what you write about concurrency between the simulator and the rendering, that sounds reasonable. However, I was interested in parallelizing the rendering. Isn't there really a way to prepare the scene, put X cameras in it, and say OGRE to render them all? It'd seem weird to me if it couldn't make this somehow efficiently... But maybe I'm wrong...

mjcarroll commented 3 years ago

I understand, was just trying to provide some background information.

I saw OGRE might have support for remotery too, so I'll try looking in this direction.

It does, but it has to be built with it. If you would like to start with our current packaged OGRE version: https://github.com/ignition-forks/ogre-2.1-release

Or is there something substantial that would prevent any kind of parallelization (e.g. come scene locks?)?

I don't believe so offhand. I know OGRE makes use of the Singleton pattern in a few places, so there is a chance that there is something lower than sensors/rendering that could cause an issue.

peci1 commented 3 years ago

I finally got a Remoter-enabled OGRE installation, and here are some data. This should be one sensor update with one EXPLORER_X2_SENSOR_CONFIG_2 (4 RGBD + 1 GPU lidar):

Screenshot_2021-01-25 Remotery Viewer

A great deal of time is spent on a function called "Forward Clustered Light Collect". That would correspond to your observation that adding lights slows things down. Unfortunately, SubT is a lot about a lot of lights :(

What kind of surprised me was that the light processing is even in GPU lidar. This shouldn't do any light-related stuff, should it?

Here are some more screenshots at various scales:

Screenshot_2021-01-25 Remotery Viewer(7)

peci1 commented 3 years ago

If anyone else wants to experiment, here is the profiler-enabled OGRE build for Ubuntu 18.04.5 amd64: libogre.tar.gz. Just unpack it, sudo dpkg -i *.deb and put libRemotery.so somewhere on your path (or run gazebo with altered LD_LIBRARY_PATH). Beware that installing this DEB archive will break your package manager as this appears to be an older version than the one installed. To fix it (and get rid of this testing library), just type sudo apt install --reinstall libogre-2.1. The library is built from https://github.com/ignition-forks/ogre-2.1-release .

The remotery server starts up on port 1500.

peci1 commented 3 years ago

I found GL extension GL_OVR_multiview. That seems to be applicable. But searching for it in OGRE forums yielded no results :(

iche033 commented 3 years ago

Forward Clustered Light Collect

Interesting, we asked it to render a pass with depth texture only but it still does light culling and/or probably other light operations. Maybe it's an optimization that can be done in OGRE. We could also look into a way to disable lighting for that particular pass. Clear passes are taking longer than expected - maybe we can get rid of this pass.

Isn't there really a way to prepare the scene, put X cameras in it, and say OGRE to render them all?

We're doing a few things to minimize scene updates: https://github.com/ignitionrobotics/ign-gazebo/blob/ign-gazebo4/src/systems/sensors/Sensors.cc#L250. But still need to tell ogre to render one camera after another though.

gedalia commented 5 months ago

I recently hit this issue I'm wondering if there has been any improvements? The performance degradation on our sim after upgrading to ignition has been pretty profound. Before I'd head down a multi-threaded or multi gpu path it seems like the way ogre is being used is pretty simple, but that's probably not helping here. With the perf I was seeing I imagine that the code was doing a GPU readback right after each camera render, if that was the case there would be pretty significant penalties for multi-camera systems. GPU rendering always delayed from call submission, I've been experimenting with another rendering engine, and I've set it up for each camera to have their own independent frame buffer target, I am also doing readbacks with pixel buffer objects, by running all the camera rendering requests in serial and then using PBOs to read back the results the gpu has time to work while the CPU submits draw calls. Replacing remotery with Tracy, which includes visualization of gpu draw call timing might help with future optimization of this system.

longcheng010101 commented 3 months ago

Hey, are there any improvements on the matter ? This will be greatly appreciated.

gazebosim / gz-sensors

Multi-threaded (or even multi-GPU) rendering? #81