Bad results with multi GPU training

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.65k stars 7.96k forks source link

Bad results with multi GPU training #1456

Closed r0l1 closed 6 years ago

r0l1 commented 6 years ago

After experimenting with the Yolov3 network for one week, I found several issues with multi GPU training. The trained network with a single GPU had way better detection results than the network trained with multiple GPUs (3x Nvidia 1070).

My last test:

yolov3-voc network
trained the first 1000 iterations with a single GPU
afterwards continued with 3 GPUs
test with 1000 iteration network: quite good results
all other networks: no results at all

Any thoughts? Why does this happen? I tried it on different datasets using also yolov3 and yolov3-spp.

Thanks!

AlexeyAB commented 6 years ago

It was tested for 4 x GPUs. So may be some issue with 3xGPU.

Can you show avg-loss chart for 1 GPU and 3 GPU for the same cfg-file and same dataset?
What mAP can you get for 1000 and 2000 iterations for 3xGPUs?
What params did you use in the Makefile?
Did you change anything in the source code?

r0l1 commented 6 years ago

I will post the avg-lost & mAP values tomorrow. I didn't change anything in the souce code and the Makefile was just adapted to:

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=1
LIBSO=0

I just tried it again with yolov3-spp with a changed width=256 & height=224 Works perfect with a net trained with one GPU. First results at 800 iterations. Tested it up to 4000. If I test the net with 3 GPUs, I never get any results at all...

I must mention, that the input images are most of the time smaller than 256 x 224 pixel... I am testing the results with OpenCV. Same input values as the sample dnn object detector (except width&height)...

AlexeyAB commented 6 years ago

Try to train first 2000 iterations with learning_rate=0.001 and 1xGPU. Then continue training with learning_rate=0.00033 and 3xGPU.

So, for the first 1000 iterations learning rate will be calculated by using burn_in parameter. For the second 1000 iterations learning rate will be equal 0.001. And then learning rate will be learning_rate * GPUs = 0.00099 ~= 0.001

If you want to compare results (mAP accuracy) with 1xGPU and 3xGPU, you should train 3x more iterations for 3xGPU (and sometimes with 3x lower learning_rate= in your cfg-file): https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-414458078

Can you show avg-loss chart for 1 GPU and 3 GPU for the same cfg-file and same dataset?

What mAP can you get for 1000 and 2000 iterations for 3xGPUs?

I will post the avg-lost & mAP values tomorrow.

Yes, then it will be clearer what happens.

r0l1 commented 6 years ago

Try to train first 2000 iterations with learning_rate=0.001 and 1xGPU. Then continue training with learning_rate=0.00033 and 3xGPU.

Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.

Edit: I noticed another small problem with bounding box misalignments, if the width & height (cfg files) is different. I am not sure if the problem lies in OpenCV or darknet... I'll open a new issue as soon as I find the cause...

AlexeyAB commented 6 years ago

@r0l1

Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.

Yes, I'll accept this pull request. Thank you!

r0l1 commented 6 years ago

Here you go #1466

AlexeyAB commented 6 years ago

@r0l1 Thanks. Also small clarification.

Does reducing learning_rate help you to get good results?
Or does training for more iterations help you to get good results? Or both?

r0l1 commented 6 years ago

@AlexeyAB thank you for the hints! I'll try this soon and play with it... Currently building up a huge database and I'll start training and testing in a couple of weeks...

Off-topic: Are you interested to hold a talk about darknet and co. if you should ever be in Munich/Germany? Please just e-mail me...

AlexeyAB commented 6 years ago

@r0l1 In the near future I do not plan to visit Germany. I will keep in mind. Thanks.