Huge Performance hit when using multiple cameras

Simon-Steinmann commented 3 years ago

Describe the Bug Enabling & retrieving multiple visual sensors (cameras, lidars), Webot's realtime factor drastically reduces. My Gpu ran at 50% usage with 2/8 GB memory. This huge performance impact drastically reduces Webots competitiveness with project relying on multiple of these sensors (SLAM, Drones, Autonomous Vehicles), which is on of the major use-cases for robot simulators these days.

Steps to Reproduce

Open this world hall_lidar_camera_2.zip
Open the robot window of the Robot at the top of the scene tree
enable all cameras and lidars
See the RTF significantly decrease.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

System

Webots 2021a
- Operating System: [Linux Ubuntu 20.04]
- Graphics Card: [e.g., NVIDIA GeForce GTX 1660 ti 8 GB,

omichel commented 3 years ago

Enabling cameras and lidars from the robot window has a performance impact which was addressed in Webots R2021b. Can you try a nightly build of Webots R2021b and let me know if you observe a better performance? After enabling the camera/lidars you should also close the robot window to have better performance.

nrotella commented 3 years ago

Enabling cameras and lidars from the robot window has a performance impact which was addressed in Webots R2021b. Can you try a nightly build of Webots R2021b and let me know if you observe a better performance? After enabling the camera/lidars you should also close the robot window to have better performance.

Thank you for the tip, I was chatting with Simon last week about this and he was kind enough to open the issue based on the problems we faced. I did try the nightly build and it does appear to improve the performance issues mentioned above; however for the full set of sensors I care to simulate (3 VLP16s and 10 cameras at 800x800), it struggled a bit to run in real-time even with the robot window closed, and the RTF would drop down to 0.4 randomly. With just a few cameras (5 or less) it seems to work ok, so I think the nightly build did fix this particular issue. @Simon-Steinmann can you try the nightly build as well and see how the performance changes, to confirm what I'm seeing?

nrotella commented 3 years ago

My personal concern is not only with the ability to simulate these sensors with RTF>=1.0 within Webots, but also to stream them out to ROS2 without a big performance hit. From what I was seeing, simulating three VLP16s was fine (RTF>1.0) and adding a single 800x800 RGB camera put me at about RTF=1.0, however adding more camera streams would drop the simulation far below real-time. @Simon-Steinmann and I were thinking perhaps the camera data needs to be compressed, or maybe you can think of some other improvement to make camera streaming to ROS2 more efficient?

Simon-Steinmann commented 3 years ago

I did some more testing with the latest 2021b build. I have created this little test project. It runs the simulation for 10s with all sensors enabled, then enables all the sensors. The performance hit is still huge. Simply open the world to test it for yourself.

Furthermore, the world requires a reset after having enabled the sensors once. Otherwise the RTF will be slow, even when the sensors are not enabled. This is reminiscent of the robot window issues. multicam_test.zip

omichel commented 3 years ago

I just tested this simulation on my R2021b version:

$ webots --stdout hall_lidar_camera_2.wbt
⋮
Realtimefactor without cameras & lidars enabled: 10.77
Realtimefactor with cameras & lidars enabled: 1.27
⋮
(reset)
⋮
Realtimefactor without cameras & lidars enabled: 1.35
Realtimefactor with cameras & lidars enabled: 1.39
⋮
$ webots --no-rendering --stdout hall_lidar_camera_2.wbt
⋮
Realtimefactor without cameras & lidars enabled: 12.19
Realtimefactor with cameras & lidars enabled: 1.56
⋮
(reset)
⋮
Realtimefactor without cameras & lidars enabled: 1.75
Realtimefactor with cameras & lidars enabled: 1.82
⋮
$ webots --sysinfo
System: Windows 10 64-bit
Processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Number of cores: 4
OpenGL vendor: NVIDIA Corporation (0x10de)
OpenGL renderer: GeForce GTX 760 (192-bit)/PCIe/SSE2 (0x118e)
OpenGL version: 4.6.0 NVIDIA 431.60

These results don't look bad to me. Rendering 10 cameras and 3 lidars has unavoidably a performance impact. One oddity here is the performance after reset which should be higher when the cameras/lidar are not yet enabled. I guess this is a bug that makes that cameras/lidars are not disabled by the reset. Another oddity is the slightly better performance after the reset. This probably comes from the fact that during the first measurement, not all the resources are allocated / deallocated / garbage collected, thus impacting the performance. On the second measurement, everything is running in and the CPU/GPU are focused on rendering only, not memory management.

It is also important to note that all the devices are enabled at a 32 ms frame rate. Obviously changing this value will have a direct impact on the RTF in one way or the other.

Simon-Steinmann commented 3 years ago

I'm just curious and would like to understand the pipeline a bit better. From My understanding Cameras and lidar sensors are GPU based. So I understand the increased GPU usage when enabling these sensors (24% to 50% for my setup, when enabling the sensors under windows). However, I dont fully understand the huge decrease in simulation speed. I suspect it has something to do with the rather large amounts of data generated having to be shuffled around. Could you give some insight into this process? I wonder if this process could not be vastly improved by parallelizing it (no serial physics calculation is needed for this step).

Simon-Steinmann commented 3 years ago

profiling.zip I profiled the latest dev build of webots. I ran the same world, however I enabled all cameras immediately. You can view the profiling data with pip3 install xdot python3 -m xdot

and then open the .dot file in the zip folder I uploaded. I noticed something.

Due to the many cameras, we have many render calls (makes sense so far)

This leads to this function being called for the main view plus each camera, instead of being called once per step. The issue arises here:

TheprepareRender function is called each time. Going through every node in the scene and applying transformations. To my understanding, this has to be done only once per timestep. If this is true, we waste tons of time (especially in complex worlds), doing this over and over again for every visual sensor.

omichel commented 3 years ago

I am not sure prepareRender can be called only once for all camera as it may depend on the parameters of the camera (different viewpoints yield to different frustum, culled objects, resolution, etc.). However, it could certainly be parallelized.

Simon-Steinmann commented 3 years ago

I'm just wondering if line 341 has to be executed for every viewport. What exactly is it doing? It says it // Updates transforms and propagates to children. Are these the transforms of the scene in world coordinates, or relative to the viewport? If the latter is required, I would understand why it has to be called so often. https://github.com/cyberbotics/webots/blob/develop/src/wren/Scene.cpp#L341

llessieux commented 3 years ago

The updateFromParent is going through the scene tree and queuing each object that needs to be rendered. The renderQueue is cleared just before that. So I think that yes you need that. However I still have to find what parameters would make the queued objects different between cameras. I guess instrumenting the code (doh it is in wren... how to do you debug that ??) to show the queue of objects for each camera would clarify that once and for all.

I have to say that everything seems a bit slow for my high spec machine in general. Even at full speed some simulations are stuck at 10% of the CPU usage.

Ok I can dump the stuff that is queued. I will check more tomorrow.

llessieux commented 3 years ago

As far as I can see. IF there is no object motion between the main scene render and the camera/lidars rendering then you can do with a single PrepareRender. The result of the calls is the same for all once the main scene has been rendered. (I instrumented the code to check that).

However on my machine doing the PrepareRender only once had barely any performance impact. Certainly it won't change the GPU load for example. Now there is something really weird with the performance I am seeing. In the sample, if I have all camera and lidars off, then I can get 16.0x the realtime speed.

As soon as I turn on a single camera, I am falling to 2.85x then it degrades from there. (with the PrepareRender called once) Compared to 2.63x when calling the PrepareRender all the time. That is a much bigger cost than I expected but then again we should probably look at the time in ms instead of 1/ms (likely what is shown here)

Default code No Camera + No Sensor 16.0x 1 cameras 2.63x 2 cameras 1.63x 3 cameras 1.15x 4 cameras 0.82x 5 cameras 0.70x 6 cameras 0.68x 7 cameras 0.60x 8 cameras 0.52x 9 cameras 0.47x All cameras 0.4x All cameras + lidars around 0.18x

PrepareRender on default Scene render only. No Camera + No Sensor 16.0x (Expected no difference with above) 1 cameras 2.85x 2 cameras 1.65x 3 cameras 1.27x All cameras + lidars around 0.18x

Now my CPU is a Ryzen9 3950X so it might be fast enough so the PrepareRender doesn't really make a difference here. But only 10% of the CPU is ever used. Not sure why, even with the simulation set to use 16 cores. I guess the simulation is not the issue here. And my GPU (RTX3090) shouldn't really be a bottleneck here either so I suspect that the way the work is queued in OpenGL is causing the big performance impact. I can dig into that a bit but it has been a while since I did some OpenGL :) so I might not be able to extract too much info. My guess, if you really want to get the best GPU performance you can, you will need to rewrite the rendering engine in Vulkan, that should enable you to render ALL cameras at the same time instead of one by one and probably benefit a lot.

llessieux commented 3 years ago

Lol, it seems that I am hitting another bug when testing. It seems that Webots thinks that I am running in a VM. I am seeing glFlush calls after every glDrawElements.

cyberbotics / webots

Huge Performance hit when using multiple cameras #2980