AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.72k stars 7.96k forks source link

Training only uses 4 cores? #1774

Open kartur opened 6 years ago

kartur commented 6 years ago

While training I see that the part where the GPU is used is finished very fast in every iteration, but then the part where the CPU is used takes rather long. This leads to a very strong bottleneck where the GPU always has to wait a relatively long time for the CPU (see "GPU Load" in screenshot). Also I noticed that training only seems to use about 3-4 cores.

I use Windows Server 2016 with the current sources (76706dc), Cuda 9.2 and cuDNN 7.2.1.38. AFAIK OpenMP and AVX are enabled by default when building under Windows.

So is there anything I can do to speed up the CPU-part? darknet_training

While training I see something like "Loaded: 6.516221 seconds" while other people seem to have values like 0.000XXX

// Nevermind, I see the loading phases can be heavily improved by lowering the filesize of he pictures. I still don't get near 100% GPU load, but it seems to be around 70% average now, which I think is acceptable. If you want, you can close this issue. :)

AlexeyAB commented 6 years ago
kartur commented 6 years ago

Thank you for your reply.

un-comment this line

So that's where we control the number of threads. Is there any particular reason it is 3 (or 12) times the number of GPUs? I did a little testing and for my system the thread-scaling seems rather odd. I can see increasing the args.threads improves CPU-utilization drastically. But for some reason it still doesn't result in better performance. I simply looked at the loading times and chose average GPU-Utilization as a metric and no matter how many CPU threads I use, loading times are similar and GPU-utilization is around 7-10%. Even when comparing 4 to 64 threads (on a machine with 88 logical cores available), there is no improvement.

Only when using batch-converted low-resolution versions of the images, I can get a significant boost in training performance. Scaled to max. 800x800 I get much better loading times resulting in a GPU utilization of ~70%

What OpenCV version do you use?

3.4.0

What batch and subdivisions do you use in your cfg-file?

batch=64 subdivisions=8

What is the average resolution of your training images?

highres images 3000x4000 lowres converted images 600x800

AlexeyAB commented 6 years ago

Also check, that bottleneck is on CPU, and not in the HDD/SSD disk, just for test you can leave only 1 image in the train.txt and run Training with args.threads = 3 * ngpus;, wait for 3-4 iterations until image will be cached in RAM, and if it will use GPU at ~80%, then bottleneck is in the HDD/SSD. In this case, just first ~1000 iterations will be slow, but next iterations will be fast since all images will be cached in RAM.


But for you high resolution images may be it is required to use higher number of threads: https://github.com/AlexeyAB/darknet/blob/3b59178dec5e9b105b2fba23001c768206da0276/src/detector.c#L120 even if you use modern server CPU: https://en.wikichip.org/wiki/intel/xeon_gold/6152 but Tesla P100 isn't so fast as Tesla V100 with enabled Tensor Cores.


Is there any particular reason it is 3 (or 12) times the number of GPUs?

For example, 3 is the optimal number of CPU Threads for pre-process images on the cloud Amazon EC2 p3.2xlarge if we compiled with OpenCV (AVX & multithreading):

args.threads = 2 * ngpus; and args.threads = 4 * ngpus; works slower than args.threads = 3 * ngpus;.


For example, Amazon allocates only 4 physical cores (8 logical cores) for p3.2xlarge-instance, perhaps OpenCV functions are very fast and highly optimized to use all AVX2-ALUs (on all CPU-ports in physical cores), so it doesn't make sense to use Hyper-Threading (logical cores), so the optimal number of CPU threads is 4 instead of 8. Since 1 CPU-thread is used for control GPU kernel-functions, therefore we should use 3 CPU-threads for image pre-processing.

Image pre-processing on CPU and batch-training on GPU go simultaneously in parallel. So usually 3 CPU-Cores pre-processes 64 images faster than GPU Tesla V100 (Tensor Cores ~100 Tflops) processes 64 images during Training, if we use OpenCV that is compiled as Release with AVX and Multithreading. For any lower GPU, less than 3 CPU cores are enough.

But for some other CPUs, for which OpenCV isn't well optimized, since OpenCV is owned by Intel company, for example, OpenCV isn't well optimized for competitor CPU AMD Ryzen 7 2700X (16 logical cores), or if we use old CPU with higher number of cores, then we should use higher args.threads: https://github.com/AlexeyAB/darknet/blob/3b59178dec5e9b105b2fba23001c768206da0276/src/detector.c#L120

kartur commented 6 years ago

How much memory does all of your images from the training dataset take?

the original pictures where the loading took multiple seconds, it was 5 GiB. The resized and compressed ones were about 180 MiB.

Did you compile OpenCV with AVX2 and Multi-threading? How did you install it?

I used the precompiled 3.4.0 for Windows. I'm not sure with which options it comes. Do you recommend compiling it again for windows?

Also check, that bottleneck is on CPU, and not in the HDD/SSD disk

I did try it with one large picture and GPU-Utilization was indeed better. I will check again how it behaves with all the pictures. But IIRC it didn't improve, even after 10K iterations. But I'm not sure anymore.

but Tesla P100 isn't so fast as Tesla V100 with enabled Tensor Cores.

I know, but this is just for a side project and I have to deal with what is available to me. :) Btw. why does the makefile say the P100 doesn't support FP16? fp16

args.threads = 2 ngpus; and args.threads = 4 ngpus; works slower than args.threads = 3 * ngpus;.

That's interesting. For me, I only tested 1, 3, 4, 16, 22, 44 threads. Using 1 threads was very slow, but between 3, 4, 16 or 44 there is basically no difference. Well, maybe higher power consumption. :)

If I find more time I will try to benchmark a little more and post the results here. Thanks for you help!

AlexeyAB commented 6 years ago

@kartur

the original pictures where the loading took multiple seconds, it was 5 GiB

So you should have at least 5 GB free CPU-RAM, during Training.

I used the precompiled 3.4.0 for Windows.

It's ok. It uses all optimizations.

I did try it with one large picture and GPU-Utilization was indeed better. I will check again how it behaves with all the pictures. But IIRC it didn't improve, even after 10K iterations. But I'm not sure anymore.

That's interesting. For me, I only tested 1, 3, 4, 16, 22, 44 threads. Using 1 threads was very slow, but between 3, 4, 16 or 44 there is basically no difference. Well, maybe higher power consumption. :)

It seems the bottleneck is on the HDD/SSD side. If you don't see acceleration after 10 000 iterations, then you should use resized small images for training.

I know, but this is just for a side project and I have to deal with what is available to me. :) Btw. why does the makefile say the P100 doesn't support FP16?

Yes, P100 (Pascal) supports FP16, and Maxwell too. But threoretical speedup for Pascal 2x, but for Volta 8x, so after FP32->FP16->FP32 conversion for Volta there remeains 2-3x acceleration, for Pascal should be ~few %.

What acceleration can you get by using CUDNN_HALF=1 for Training and Detection on P100?


There is +30% acceleration (- 1-2% mAP) for INT8 quantization on GPU Pascal for Detection by using this repo: https://github.com/AlexeyAB/yolo2_light just add flag -quantized at the end of command and add input_calibration= param to your cfg-file from corresponding cfg-file: https://github.com/AlexeyAB/yolo2_light/blob/3e081e82fe300ab0f097a5e26c95e4d028ac9882/bin/yolov3.cfg#L25