AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Bad results with multi GPU training #1456

Closed r0l1 closed 6 years ago

r0l1 commented 6 years ago

After experimenting with the Yolov3 network for one week, I found several issues with multi GPU training. The trained network with a single GPU had way better detection results than the network trained with multiple GPUs (3x Nvidia 1070).

My last test:

Any thoughts? Why does this happen? I tried it on different datasets using also yolov3 and yolov3-spp.

Thanks!

AlexeyAB commented 6 years ago

It was tested for 4 x GPUs. So may be some issue with 3xGPU.

r0l1 commented 6 years ago

I will post the avg-lost & mAP values tomorrow. I didn't change anything in the souce code and the Makefile was just adapted to:

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=1
LIBSO=0

I just tried it again with yolov3-spp with a changed width=256 & height=224 Works perfect with a net trained with one GPU. First results at 800 iterations. Tested it up to 4000. If I test the net with 3 GPUs, I never get any results at all...

I must mention, that the input images are most of the time smaller than 256 x 224 pixel... I am testing the results with OpenCV. Same input values as the sample dnn object detector (except width&height)...

AlexeyAB commented 6 years ago

Try to train first 2000 iterations with learning_rate=0.001 and 1xGPU. Then continue training with learning_rate=0.00033 and 3xGPU.

So, for the first 1000 iterations learning rate will be calculated by using burn_in parameter. For the second 1000 iterations learning rate will be equal 0.001. And then learning rate will be learning_rate * GPUs = 0.00099 ~= 0.001


If you want to compare results (mAP accuracy) with 1xGPU and 3xGPU, you should train 3x more iterations for 3xGPU (and sometimes with 3x lower learning_rate= in your cfg-file): https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-414458078


Can you show avg-loss chart for 1 GPU and 3 GPU for the same cfg-file and same dataset?

What mAP can you get for 1000 and 2000 iterations for 3xGPUs?

I will post the avg-lost & mAP values tomorrow.

Yes, then it will be clearer what happens.

r0l1 commented 6 years ago

Try to train first 2000 iterations with learning_rate=0.001 and 1xGPU. Then continue training with learning_rate=0.00033 and 3xGPU.

Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.

Edit: I noticed another small problem with bounding box misalignments, if the width & height (cfg files) is different. I am not sure if the problem lies in OpenCV or darknet... I'll open a new issue as soon as I find the cause...

AlexeyAB commented 6 years ago

@r0l1

Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.

Yes, I'll accept this pull request. Thank you!

r0l1 commented 6 years ago

Here you go #1466

AlexeyAB commented 6 years ago

@r0l1 Thanks. Also small clarification.

r0l1 commented 6 years ago

@AlexeyAB thank you for the hints! I'll try this soon and play with it... Currently building up a huge database and I'll start training and testing in a couple of weeks...

Off-topic: Are you interested to hold a talk about darknet and co. if you should ever be in Munich/Germany? Please just e-mail me...

AlexeyAB commented 6 years ago

@r0l1 In the near future I do not plan to visit Germany. I will keep in mind. Thanks.