Open custom-build-robots opened 3 years ago
Actually that is surprising because in DC 4.x we are using tf.data.Dataset.from_generator()
which is supposedly the fastest way to shove data into the GPU and increase utilisation. I'm running a single GPU only and saw performance improvements when we moved to 4.x. Can you play around with the other parameters of that functions, maybe there are some settings to improve multiple gpus?
@DocGarbanzo okay, I will have a look into that topic. I saw the problem on different servers (2 x RTX 8000 and 2 x RTX 3090) with the latest version of the DC framework. Another problem is that always the full RAM of the GPU is used. Maybe the tensor flow memory growth experimental function could also be implemented. Maybe I will get that also running...
You need to use MirroredStrategy
to be able to use multiple GPUs when using Tensorflow. Its essentially a preamble to the training script we have.
https://www.tensorflow.org/guide/distributed_training#mirroredstrategy
You will need to specify the GPU ids to use, and then TF will start distributing your workload. Here is an example.
@DocGarbanzo also implemented an improvement to caching when using augmentation in PR https://github.com/autorope/donkeycar/pull/1050 Total training time on my RTX-2060 went from 2.5 hours to 45 minutes for 21 epochs
I have currently received a PC for testing purposes on which I may install the Donkey Car Framework. While training a neural network with more than 120,000 data sets, I noticed that there is a performance problem. The PC has as CPU an AMD ThreadRipper with 24 cores and two RTX 8000 graphics cards. Unfortunately, the graphics cards are not supplied with images fast enough via the CPU. As a result, GPU 1 runs at about 30% load and GPU 2 at 0% to 1%. I would like to understand where exactly the problem lies and how it could be addressed. Because with the Donkey Car Framework 2.5.8 I did not have the problem with 6 Tesla V100. These work under full load when training the neural network (version 2.5.8).
The following video shows quite well the problem: https://www.youtube.com/watch?v=26up9I1K3fg
I would be very happy about a feedback and maybe already about hints how I could tackle the problem to improve the parallelization of image processing.