Reproduce author's yolov3 training results

MickaMickaMicka commented 5 years ago

Sorry, this is not really an issue about this darknet fork, but as far as I've seen, you guys are quite interested in helping the DNN community here, even for such questions.

I am having a hard time in training a yolo v3 network, both training from the scratch and as a refinement of the pretrained yolov3.

Because of my 1080 ti GPU I only can train with a batch size of 64 and tried with subdivision 32 and 16 (but 16 gave a Cuda error for bigger dimensioned resizes of the images).

Can I expect this to be the problem of getting a much worse mAP than the original yolo author, or am I probably doing something else very wrong? What is the best advise to improve training results on COCO dataset? Just getting more GPUs to train with bigger batch sizes?

Finally I would like to add new classes to the yolo network (and/or have other classes replaced) without getting much worse results for the pretrained classes.

dreambit commented 5 years ago

as far as i know, batch size and subdivision only affects perfomance not accuracy, here they are explained https://github.com/pjreddie/darknet/issues/224#issuecomment-335771840 try to set batch size and subdivision to 64 and increase network size(width, height) also keep in mind this https://github.com/AlexeyAB/darknet#when-should-i-stop-training

MickaMickaMicka commented 5 years ago

Thank you @dreambit , according to some comments, the subdivision has an influence on the training result (accuracy). I'll try to take some research by switching off "random" (to archieve repeatability) and experimenting with different subdivisions, when I can find some time.

Do you know whether in darknet, RAM will be saved if layers are frozen, since frozen layer's activations do not need to be stored during the forward step, because they are not used during backward propagation?

AlexeyAB commented 5 years ago

@MickaMickaMicka

Do you know whether in darknet, RAM will be saved if layers are frozen, since frozen layer's activations do not need to be stored during the forward step, because they are not used during backward propagation?

Will be allocated the same amount of CPU & GPU RAM in any case.

May be later I will add such optimization, so layers before stopbackward=1 will not allocate RAM for backward and update.

dreambit commented 5 years ago

@MickaMickaMicka, I think you are right: https://github.com/pjreddie/darknet/issues/224#issuecomment-408721048

By using a smaller subdivision, the mini-batch size for computing gradient increases. Hence, the computed gradient based on a larger mini-batch size gives a better optimization. I guess using a smaller mini-batch size will result in a local optimum and thus decrease accuracy.

I'm interested in @AlexeyAB opinion about that, does mini-batch size plays an important role for big dataset with > 20k images?

Is it good to have smaller mini-batch size(batchsize, subdivision = 64) in favour of network size? if we are limited in gpu memory size. I used to increase network size and set batchsize, subdivision = 64 but now i am not sure...

And one more question about multi gpu training: if i have 4 gpu , with 6gb memory on each, does it mean that i can increase network size and have a larger mini-batch without out of memory? Thanks

AlexeyAB commented 5 years ago

@dreambit

The accuracy is affected by: network size, batch size, mini_batch size.

In general,

larger mini_batch - higher speed (but it is limited by GPU-RAM). Also higher mini_batch size can affect on batch-normalization layer and increase accuracy
larger batch - higher accuracy

About mini_batch size and batch-normalization layer: https://papers.nips.cc/paper/6790-batch-renormalization-towards-reducing-minibatch-dependence-in-batch-normalized-models.pdf

For small minibatches, the estimates of the mean and variance become less accurate. These inaccuracies are compounded with depth, and reduce the quality of resulting models

In the most cases, the network size (width, height in cfg) is more important for accuracy than mini_batch size, especially for small objects and if mini_batch >= 4.

the higher network size - the higher accuracy - the slower training and detection.

And one more question about multi gpu training: if i have 4 gpu , with 6gb memory on each, does it mean that i can increase network size and have a larger mini-batch without out of memory? Thanks

No. mini_batch will be the same, because cuDNN can't parallel forward/backward across GPUs. mini_batch=batch/subdivisions. But when you use 4xGPUs then the batch size will be 4x larger, so it can affect on final accuracy. In general the larger batch - the better, especially for batch-normalization.

So if you use 1 GPU, then you can try to use batch=256 subdivisions=256 or batch=256 subdivisions=64

Because yolov3.weights was trained by using 4 x GPUs with batch=64 subdivisions=16.

dreambit commented 5 years ago

@AlexeyAB, thank you so much for this great explanation.

I am using default yolov3.cfg, with batch=64 subdivisions=32 width=736 height=736

and calculated anchors for our dataset.

I trained it on p2.xlarge(nvidia K80, 12 GiB memory) for 22 hours, stopped at 2100 iteration, avg loss was 2.15(i know that it must be less then 1 and there should be ~4r iterations in total); darknet.exe detector map shows (mAP) = 0.884273, average IoU = 74.99 %

I am going to detect small and large objects, i don't now how much small should be an object to be called small but the anchors are anchors = 33, 41, 58, 51, 44, 89, 82, 74, 67,124, 111,105, 102,189, 161,143, 190,231

I will continue training the model for more 1000(maybe 2000) iterations, save weights each 100 iter. and compare mAP to find the best one.

I'm considering two options from aws: 1) p2.xlarge with K80 12Gb on board 2) p3.2xlarge with Tesla V100 16 Gb

What is your recommendation about network size, batch, subdivisions for p2.xlarge and p3.2xlarge with more gpu memory on board?

How much faster will be p3.2xlarge compared to p2.xlarge with CUDNN_HALF=1? p3 is three times as expensive as p2 ))

Input image size for training and inference is 720x526.

Thanks in advance 👍

AlexeyAB commented 5 years ago

@dreambit

Did you compile Darknet with GPU=1 CUDNN=1 OPENCV=1 and run with -dont_show flag?

How much faster will be p3.2xlarge compared to p2.xlarge with CUDNN_HALF=1?

It depends on mini_batch size. I think about ~3x times faster with CUDNN_HALF=1 OPENCV=1 and ~2x times faster with CUDNN_HALF=0. But CUDNN_HALF=1 will activated only after 3000 iterations (burn_in*3).

For p3.2xlarge you should compile Darknet with OPENCV=1 to avoid bottleneck on CPU (data augmentation).

anchors = 33, 41, 58, 51, 44, 89, 82, 74, 67,124, 111,105, 102,189, 161,143, 190,231

Did you calculate anchors for -width 736 -height 736 ? This is close to the optimal network size.

Input image size for training and inference is 720x526.

In this case you could use width=704 height=512 in cfg-file and for anchors. You can use non-sqaure network size in this repository. And it isn't necessary to use network size larger than image size.

What is your recommendation about network size, batch, subdivisions for p2.xlarge and p3.2xlarge with more gpu memory on board?

In your case, I recommend to use width=704 height=512 in cfg and for anchors. Use random=1 batch=64 subdivisons= as low as possible (try to set width=960 height=704 random=0 and set subdivisons=4,8,10,16,24,32,48,64) then keep the lowest that doesn't cause an Out of memory error and train with width=704 height=512 random=1

dreambit commented 5 years ago

Did you compile Darknet with GPU=1 CUDNN=1 OPENCV=1 and run with -dont_show flag?

Unfortunately i forgot to set OPENCV=1, in short, i did it on my laptop but not in the docker image. Does it increase performance on K80or it only makes sense on GPU Volta?

But CUDNN_HALF=1 will activated only after 3000 iterations (burn_in*3)

Since i haveclasses = 1, total iterations needed ~ 4000, Tesla V100 will not be used at 100 percent, only last 1000 :(

Did you calculate anchors for -width 736 -height 736 ? This is close to the optimal network size.

yes

You can use non-sqaure network size in this repository.

But still must be divisible by 32, right?

In this case you could use width=704 height=512 in cfg-file and for anchors.

Maybe better736x544? If the network size is width=704 height=512 and image size is 720x526, the image will be resized to fit into the network, won't it?

In your case, I recommend to use width=704 height=512 in cfg and for anchors. Use random=1 batch=64 subdivisons= as low as possible (try to set width=960 height=704 random=0 and set subdivisons=4,8,10,16,24,32,48,64) then keep the lowest that doesn't cause an Out of memory error and train with width=704 height=512 random=1

So if you use 1 GPU, then you can try to use batch=256 subdivisions=256 or batch=256 subdivisions=64

What about batch=256 and subdivisons=4,8,10,16,24,32,48,64? there is one gpu

What about if i run training on my laptop withgpu=0, width=960 height=704 random=0 and set subdivisons=4,8,10,16,24,32,48,64, maybe total RAM used ~ GPU used with gpu=1? in order to calc required memory

@AlexeyAB thanks for your help 👍

AlexeyAB commented 5 years ago

@dreambit

Does it increase performance on K80 or it only makes sense on GPU Volta?

I think it isn't necessary for K80.

Since i have classes = 1, total iterations needed ~ 4000, Tesla V100 will not be used at 100 percent, only last 1000 :(

Yes.

But still must be divisible by 32, right?

Yes.

Maybe better 736x544? If the network size is width=704 height=512 and image size is 720x526, the image will be resized to fit into the network, won't it?

Sometimes a little bit higher network resolution gives better accuracy, but sometimes a little bit lower network resolution gives better accuracy.

What about batch=256 and subdivisons=4,8,10,16,24,32,48,64? there is one gpu

Yes, you can try to use batch=256.

What about if i run training on my laptop with gpu=0, width=960 height=704 random=0 and set subdivisons=4,8,10,16,24,32,48,64, maybe total RAM used ~ GPU used with gpu=1? in order to calc required memory

This is not very accurate way. You can try to use it, but try to run width=960 height=704 random=0 on GPU first, and only then run width=704 height=512 random=1

dreambit commented 5 years ago

@AlexeyAB, thanks, and the last question about subdivisions, you have said to use one of subdivisions (4,8,10,16,24,32,48,64), but having the batch size 64, subdivision 10, 64 / 10 != integer? is it okey? Oo :)

AlexeyAB commented 5 years ago

@dreambit

64 / 10 != integer? is it okey? Oo :)

It is ok. Just actually will be used batch=60 subdivisions=10 and mini_batch=6, it will be adjausted automatically.

Also you can set batch=70 subdivisions=10 in cfg.

AlexeyAB / darknet

Reproduce author's yolov3 training results #2199