Closed r0l1 closed 6 years ago
It was tested for 4 x GPUs. So may be some issue with 3xGPU.
Can you show avg-loss chart for 1 GPU and 3 GPU for the same cfg-file and same dataset?
What mAP can you get for 1000 and 2000 iterations for 3xGPUs?
What params did you use in the Makefile?
Did you change anything in the source code?
I will post the avg-lost & mAP values tomorrow. I didn't change anything in the souce code and the Makefile was just adapted to:
GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=1
LIBSO=0
I just tried it again with yolov3-spp with a changed width=256 & height=224 Works perfect with a net trained with one GPU. First results at 800 iterations. Tested it up to 4000. If I test the net with 3 GPUs, I never get any results at all...
I must mention, that the input images are most of the time smaller than 256 x 224 pixel... I am testing the results with OpenCV. Same input values as the sample dnn object detector (except width&height)...
Try to train first 2000 iterations with learning_rate=0.001
and 1xGPU.
Then continue training with learning_rate=0.00033
and 3xGPU.
So, for the first 1000 iterations learning rate will be calculated by using burn_in parameter. For the second 1000 iterations learning rate will be equal 0.001
. And then learning rate will be learning_rate * GPUs = 0.00099
~= 0.001
If you want to compare results (mAP accuracy) with 1xGPU and 3xGPU, you should train 3x more iterations for 3xGPU (and sometimes with 3x lower learning_rate=
in your cfg-file): https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-414458078
Can you show avg-loss chart for 1 GPU and 3 GPU for the same cfg-file and same dataset?
What mAP can you get for 1000 and 2000 iterations for 3xGPUs?
I will post the avg-lost & mAP values tomorrow.
Yes, then it will be clearer what happens.
Try to train first 2000 iterations with learning_rate=0.001 and 1xGPU. Then continue training with learning_rate=0.00033 and 3xGPU.
Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.
Edit: I noticed another small problem with bounding box misalignments, if the width & height (cfg files) is different. I am not sure if the problem lies in OpenCV or darknet... I'll open a new issue as soon as I find the cause...
@r0l1
Thank you @AlexeyAB . This was the problem. I trained several new networks over the last few hours and the problem is fixed now. This should be included into the README. If you agree, I'll prepare a pull request.
Yes, I'll accept this pull request. Thank you!
Here you go #1466
@r0l1 Thanks. Also small clarification.
learning_rate
help you to get good results?@AlexeyAB thank you for the hints! I'll try this soon and play with it... Currently building up a huge database and I'll start training and testing in a couple of weeks...
Off-topic: Are you interested to hold a talk about darknet and co. if you should ever be in Munich/Germany? Please just e-mail me...
@r0l1 In the near future I do not plan to visit Germany. I will keep in mind. Thanks.
After experimenting with the Yolov3 network for one week, I found several issues with multi GPU training. The trained network with a single GPU had way better detection results than the network trained with multiple GPUs (3x Nvidia 1070).
My last test:
Any thoughts? Why does this happen? I tried it on different datasets using also yolov3 and yolov3-spp.
Thanks!