autorope / donkeycar

Open source hardware and software platform to build a small scale self driving car.
http://www.donkeycar.com
MIT License
3.16k stars 1.3k forks source link

Donkey Car Framework - performance or parallelization problem during training #796

Open custom-build-robots opened 3 years ago

custom-build-robots commented 3 years ago

I have currently received a PC for testing purposes on which I may install the Donkey Car Framework. While training a neural network with more than 120,000 data sets, I noticed that there is a performance problem. The PC has as CPU an AMD ThreadRipper with 24 cores and two RTX 8000 graphics cards. Unfortunately, the graphics cards are not supplied with images fast enough via the CPU. As a result, GPU 1 runs at about 30% load and GPU 2 at 0% to 1%. I would like to understand where exactly the problem lies and how it could be addressed. Because with the Donkey Car Framework 2.5.8 I did not have the problem with 6 Tesla V100. These work under full load when training the neural network (version 2.5.8).

The following video shows quite well the problem: https://www.youtube.com/watch?v=26up9I1K3fg

I would be very happy about a feedback and maybe already about hints how I could tackle the problem to improve the parallelization of image processing. GPU_problem_small

DocGarbanzo commented 3 years ago

Actually that is surprising because in DC 4.x we are using tf.data.Dataset.from_generator() which is supposedly the fastest way to shove data into the GPU and increase utilisation. I'm running a single GPU only and saw performance improvements when we moved to 4.x. Can you play around with the other parameters of that functions, maybe there are some settings to improve multiple gpus?

custom-build-robots commented 3 years ago

@DocGarbanzo okay, I will have a look into that topic. I saw the problem on different servers (2 x RTX 8000 and 2 x RTX 3090) with the latest version of the DC framework. Another problem is that always the full RAM of the GPU is used. Maybe the tensor flow memory growth experimental function could also be implemented. Maybe I will get that also running...

tikurahul commented 3 years ago

You need to use MirroredStrategy to be able to use multiple GPUs when using Tensorflow. Its essentially a preamble to the training script we have.

https://www.tensorflow.org/guide/distributed_training#mirroredstrategy

You will need to specify the GPU ids to use, and then TF will start distributing your workload. Here is an example.

Ezward commented 1 year ago

@DocGarbanzo also implemented an improvement to caching when using augmentation in PR https://github.com/autorope/donkeycar/pull/1050 Total training time on my RTX-2060 went from 2.5 hours to 45 minutes for 21 epochs