najibghadri commented 4 years ago

Carla 0.9.8 Linux TITAN X

Dear CARLA team! First of all I love the goal of the project and I think there is great potential. I am working on a scene-understanding project with all the stuff (stereo image processing, deep learning, etc). I have been working on this project for one and a half months now, and I choose CARLA in the beginning for it being the sole simulator I found with two essential requirements I need: custom camera setting, simulated traffic and maybe an API that let's tinker with the world. Before I decided on this GTA V was my other option, but this seemed better suiting, and honestly I still think so.

However when it comes to rendering scenarios it is terrible. I spawn 50 cars 50 pedestrians and my car with only one camera on 700x700 and I hit record and the FPS is 15. Cars hit each other, they drive nudging, and I can't do a decent record with even one camera when I will need four at least.

I wonder why is this the case? Why games like GTA V, Pubg or anything run with no lag 60fps as expected on a GPU like mine and even on weaker gpus, and a simple CAR simulator runs on 10-15 fps? I know roughly how games work, but I can't think of a reason why this happens. Is this because of python solely?

Most of these issues went unsolved for them. Since this project was created with the goal of helping scientists of Autonomous driving, deep learning, image processing research I think the most essential thing is being able to render images with decent quality and frame rate when having GPUs like that.

Perhaps you could explain here what is the core reason for this. I am sure is it a technical reason, in which case haven't you considered trying to get to the root of the problem? I am sure if these crazy games today can achieve 30-60 fps on average machines, our beloved car simulator on a strong gpu can be running better! Thanks!

germanros1987 commented 4 years ago

@najibghadri Thanks for your issue.

We understand your frustration. The faster CARLA can do sensor simulation the better. There is no doubt about that. However, here is the thing: sensor simulation is computationally very expensive.

Videogames like GTA-V are optimized to render 1 camera in real-time with very large budgets and game engines created ad-hoc for that purpose. CARLA is a general Autonomous Driving simulator, which means that we need to cover many other use cases besides sensor simulation/rendering. We need to provide a general and flexible architecture to allow the extension of the CARLA server via multiple clients, we need to account for a generic mechanism to render multiple sensors others than cameras, and all of that for any map that a user can come up with. This introduces CPU overheads. So even if you got a new NVIDIA RTX XYZ, how good is your CPU? Did you get the latest Xeon?

Down to the technical bottleneck: Unreal Engine 4. Long story short, rendering large maps over 1 GPU without proper optimization for that map is going to generate a bottleneck. Now, what are we doing to solve all these issues and make CARLA faster?

Make traffic independent of the server to alleviate computation load
Optimize assets and materials
Keep upgrading the version of UE4 until we end up with a decent support for Vulkan (not as mature as you may thing...)
Designing a multi-server architecture to use multiple GPUs to render multiple sensors.

We take this problem seriously and we have a team of amazing engineers working on improving CARLA. If somebody from the community has a reasonable plan of attack to improve performance without losing flexibility I am happy to hear it.

najibghadri commented 4 years ago

@germanros1987 Thanks for the answer! I imagine your challenges and I will try to bring the best out of Carla. Wishing the best. I was thinking now I should set it to fixed simulation time (synchronous) and even though the simulation time won't be real but the rendering replayed hopefully will be good, I might update here about that.

jahaniam commented 4 years ago

@germanros1987 I have proposed a way to improve the speed using multiple gpus in this issue:

2296

najibghadri commented 4 years ago

I managed to render a very nice scenario with 10 cameras and I made a pull request on the modifications on manual_control.py with which I managed to achieve it. The solution was fixed simulation time (30FPS), synchronous mode, a new server tick thread separate from the main client thread and ticks synchronized with all cameras for which I used semaphore. Obviously as I said previously this wasn't real-time, in fact it was 2 FPS. I let it run overnight for around 2 hours which gives me an 8 minute scenario. And this is okay for me since I am at home quarantining and running it on the remote server overnigths.

In the description I also wrote about suggestion for documentation improvement regarding this. I hope this is useful!

Very short clip of it: ezgif com-video-to-gif (1)

germanros1987 commented 4 years ago

How is this better than what we have in CARLA? What is the FPS that you get if you must spawn the 10 sensors following the standard procedure?

najibghadri commented 4 years ago

Same FPS. The point here is setting it to fixed time-step with sync, without that in variable time-step werid things happen (cars don't drive following a line, but in a zig zag, hit each other, and pedestrians teleport). It wasn't clear to me that fixed time-step is the key, and I think it should be defaulted or maybe explain that if other experience werid things this should be used. So spawn_npc should only be used with --sync. And what I did in the file is instead of letting the spawn_npc be the sync master I made the manual_control the sync master and synchronized it to wait for images for every tick to save them hence checking for each tick all 10 images arrive, because I experienced that sometimes images don't get saved well or at all. I elaborated in the pull-request but it is ok if you don't find this useful, it worked for me, maybe only I experience this, and otherwise everything should work fine and images otherwise should get saved well even if you overload with a bunch of io writes.

germanros1987 commented 4 years ago

It should not be the default. This is well documented in our documentation and there are good reasons for not having the sync mode as a default.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

germanros1987 commented 4 years ago

I am reopening this...just to get the satisfaction to close it later during the 0.9.10 release. Expect big things in terms of performance improvement for 0.9.10.

hh0rva1h commented 3 years ago

@germanros1987 Hope this belongs here since it's a performance issue: We experience that Carla in synchronous mode is very slow compared to asynchronous mode here on a GTX 1080 Machine:

When Carla is run in -quality-level=Low the ticks per second of the asynchronous mode is around 270 here while the ticks per second in synchronous mode is around 30. When quality is set to epic the difference is less (27 vs. 80 ticks) but still relatively much.

Is that to be expected? Why is synchronous mode so much slower? The script I use to determine those numbers can be found here: https://gist.github.com/hh0rva1h/7e983688c399c518634ebcc0606c9523

To reproduce start Carla and set the sync variable in the script to True to measure the tick performance in synchronous mode and set it to False to measure asynchronous mode.

germanros1987 commented 3 years ago

But the asynchronous version of your code doesn't seem correct. Where is the wait_for_tick? You are just getting garbage (whatever is already cached by the carlalib) from get_velocity. Could that be the issue? In other words, the asynchronous mode is not that quick.

hh0rva1h commented 3 years ago

@germanros1987 I updated the script to measure performance, see https://gist.github.com/hh0rva1h/7e983688c399c518634ebcc0606c9523

In async mode I get between 205 and 225 ticks per second (when running script for the first time it is always around 205 ticks when invoking a second time the tick rate rises to 225). In sync mode I get between 22 and 55 ticks per second (when running the script for the first time it is always around 22 ticks, when invoking a second time the tick rate rises to 55).

@marcgpuig already confirmed me per email that he could reproduce the issue.

Shall I open a new bug report for this issue? Could this be added to the documentation here?: https://carla.readthedocs.io/en/latest/adv_synchrony_timestep/

germanros1987 commented 3 years ago

We are looking into it. Let's see if we can find a bottleneck in the synchronous mode. We will come back to you.

syveqc commented 3 years ago

I was also able to confirm this using a machine with a GTX 1060 and one with a GTX 1080. (GTX 1060 got 53 ticks in async mode, 33 in sync mode, GTX 1080 got 241 ticks in async mode, 23 in sync mode) But the interesting thing was, when I connected via the local network from the other machine respectively, there was no real difference between the two modes, I got the tick counts from async mode in both cases! So with the GTX 1080 as the server, GTX 1060 as client I got around 220-240 ticks, with GTX 1060 as server, GTX 1080 as client I got 50-60 ticks. Hope this helps for figuring out the bug!

syveqc commented 3 years ago

I just tested it on Windows (on the GTX 1080 machine) and the problem does not seem to exist there, I get 100-120 ticks consistently in both sync and async mode!

marcgpuig commented 3 years ago

Thanks @hh0rva1h @syveqc Seems like the Nagle algorithm is enabled by default in Linux but not in Windows. We've forced the deactivation of it with ip::tcp::no_delay and now we're experimenting 60 fps in both sync and async with tests similar to the one that @hh0rva1h provided. This change is currently being tested, If you want to try it, you can do so in the branch marcgpuig/sync_impr. All your feedback is welcome :)

marcgpuig commented 3 years ago

The changes are already in master!

hh0rva1h commented 3 years ago

@marcgpuig Thanks, can confirm the fix, frame rates are now identical.

We have yet another performance problem: calling world.get_map().get_waypoint(pos) after every few ticks limits the frame rate yet again to around 20 fps, which comes as a surprise to use since the lane invasion sensor does not slow down the simulation at all. Shall I open another but report for this issue?

germanros1987 commented 3 years ago

Hi @hh0rva1h,

Indeed. Please open a new issue so that we can track this down properly.

hh0rva1h commented 3 years ago

@germanros1987 Done, see https://github.com/carla-simulator/carla/issues/2992

tinmodeHuang commented 3 years ago

hi @germanros1987 if I'm going to purchase a machine to smoothly run Carla, can you give me some advices like more cores for an CPU, higher Processor Base Frequency or more GPUs? specifically, which one I should between 8-cores W-3223+4*RTX 3090 and 16-cores W-3245+3*RTX 3090, it looks like W-3223 is close to R7 2700X being in use in performance

germanros1987 commented 3 years ago

Hi,

Remember that all depends on your particular use cases. There are use cases for which you just won't get real-time today. For instance, if you try to run a simulation of a vehicle with several LIDARs, a dozen cameras, etc. No matter what hardware you get, it won't run in real-time today.

For reasonable use cases, both hardware configurations are very decent. The frequency vs core balance is a complicated question since CARLA leverages both for different purposes. More threads are going to help expedite LIDAR and RADAR simulation and also the traffic manager. A higher frequency is going to help to keep up with the GPU.

At the end of the day, it is your call. To me the W-3223 looks good enough.

On Sun, Oct 25, 2020 at 2:15 AM tinmodeHuang notifications@github.com wrote:

hi @germanros1987 https://github.com/germanros1987 if I'm going to purchase a machine to smoothly run Carla, can you give me some advices like more cores for an CPU, higher Processor Base Frequency or more GPUs? specifically, which one I should between 8-cores W-3223+4RTX 3090 and 16-cores W-3245+3RTX 3090, it looks like W-3223 is close to R7 2700X being in use in performance

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/carla-simulator/carla/issues/2617#issuecomment-716116761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJF3VTC7CS3G5HBE4DIZA3TSMPUEJANCNFSM4LO4C7RA .

tinmodeHuang commented 3 years ago

In my use case, a most number of vehicles and pedestrian were spawned in a map with the script spawn_npc.py. I could feel obviously lagging of frames from view of the spectator especially when the one was moving, and sometimes certain vehicles was shaking back and forth. indicators of different components of my machine were normal relatively from the Session Manager, but it was of note that cores of the cpu R7 2700X all was in use and clock frequency kept at some high level, so I guess that RAM 16G+GPU GTX-1660Ti may be not the bottleneck, while CPU was. in addition, it often takes a long time to train RL agent.

At last, could anyone tell me if I should upgrade the CPU to the more powerful one? if any, thanks in advance @germanros1987

Akul2010 commented 1 year ago

i can't buy a new computer, my CPU and GPU is 16gb, I have 70gb of free SSD currently, and I still can't use sensor code on low quality without lag please help

carla-simulator / carla

CARLA IS SLOW #2617

2296