Questions Retraining YOLO

pnambiar commented 6 years ago

I have followed the procedure for ' How to train (to detect your custom objects):' and I'm able to get very good detection results. But I am trouble understanding how the 'retraining' process works. I am assuming
darknet19_448.conv.23 is the YOLO model trained on Imagenet. When we are training to detect custom object, are we training all the layers or just the last layers? If so what are the learning rates and where can I get this information?

AlexeyAB commented 6 years ago

We are training all the layers. But initially weights loaded from darknet53.conv.74. For Yolo v3 we use darknet53.conv.74 instead of darknet19_448.conv.23 https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Learning rates - speed of training. https://www.quora.com/What-is-the-learning-rate-in-neural-networks

pnambiar commented 6 years ago

Thanks very much for the clarification. My question: Are the learning rates same for all the layers? All the layers are uniformly getting trained then ? Is it possible to fine-tune just the final layer by keeping the lr of the first layers 0? Does that give better results ?

judwhite commented 6 years ago

@AlexeyAB Are there times when you should start with a new weights file?

Is it okay to update:

Height/Width?
Classes? (specifically, class count)
Filters? (by extension of classes)
Anchors?
Random?

I had a weights file that was working well, then went from 38 to 50 classes. I updated all relevant fields (classes, filters, anchors in yolo.cfg, and classes in obj.data). When I re-ran it started decreasing loss at first then after a while it was all nan's. Any help appreciated, thanks.

AlexeyAB commented 6 years ago

@judwhite Pre-trained weights slightly depends on Width and Height only. And depends on filters on the first 74 layers if is used file darknet53.conv.74.

In your case with nans:

check that you changed filters and classes in all of 3 yolo-layers
check your dataset using Yolo_mark

AlexeyAB commented 6 years ago

@pnambiar

Learning rates are same for all the layers and uniformly getting trained.

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

to speedup training (with decreasing detection accuracy) do Fine-Tuning instead of Transfer-Learning, set param stopbackward=1 in one of the penultimate convolutional layers before the 1-st [yolo]-layer, for example here: https://github.com/AlexeyAB/darknet/blob/0039fd26786ab5f71d5af725fc18b3f521e7acfd/cfg/yolov3.cfg#L598

judwhite commented 6 years ago

@AlexeyAB When I start over (modifying classes, filters, anchors only - height/width are the same) it works, but if I use an old weights file created with less classes/different anchors I run into the nan issue after some iterations.

I double checked the config files. I created a template to update these files automatically (yolo, obj.data, train.txt) so I don't have copy/paste issues. 😄

What info would you need if I'm able to reproduce the nan issue I'm seeing? I assume the old and new yolo cfg, obj.data, train.txt, obj.names, the previous weights file, and the new training set. Anything else? Is there any debugging I can do on my side?

Please let me know if you'd prefer a new Issue, I thought this one was okay because it's about retraining YOLO. Thanks.

AlexeyAB commented 6 years ago

@judwhite You can compress and drug-n-drop to your message all files required for training, if it is not very big ) After how many iterations do all the "nan" occur? Do you get all nan, even avg loss is nan?

When I start over (modifying classes, filters, anchors only - height/width are the same) it works, but if I use an old weights file created with less classes/different anchors I run into the nan issue after some iterations.

Do you use whole old weights such as yolo-obj_10000.weights or partially prepared pre-trained weights such as darknet53.conv.74 as can be created using these commands? https://github.com/AlexeyAB/darknet/blob/4d9a2bdac688f9c949b304dde8188a40efce1b49/build/darknet/x64/partial.cmd#L9-L30

judwhite commented 6 years ago

@AlexeyAB It's 300MB and the weights file is about the same. I have hosting so no big deal I can put the train set/weights file up there.

After how many iterations do all the "nan" occur?

Not sure, I started it with an old 3000 (using batch=64, subdivisions=32) weights file and when I came back it was in the 5000's spitting out nans.

Do you get all nan, even avg loss is nan?

Yes

Do you use whole old weights such as yolo-obj_10000.weights or partially prepared pre-trained weights

Neither, I started from random to get to 3000, then used my 3000 once I had more labeled data.

AlexeyAB commented 6 years ago

@judwhite

Neither, I started from random to get to 3000, then used my 3000 once I had more labeled data.

Do you mean that?

You run training darknet detector train obj.data yolo-obj.cfg and train 3000 iterations
Then stop training and change classes and anchors in the yolo-obj.cfg?
Then run darknet detector train obj.data yolo-obj.cfg backup/yolo-obj_3000.weights

This isn't the correct way. To use pre-trained weights you should leave only these layers which are the same in both models using partial command, for example

darknet.exe partial cfg/yolo-obj.cfg backup/yolo-obj_3000.weights yolo-obj.conv.23 23
And then train darknet detector train obj.data yolo-obj.cfg yolo-obj.conv.23

judwhite commented 6 years ago

@AlexeyAB Yes that's exactly what I was doing, thank you.

What's the significance of yolo-obj.conv.23 and 23? I'm using YOLOv3.

AlexeyAB commented 6 years ago

@judwhite 23 - is a number of layers that you want to leave in the yolo-obj.conv.23 file. For Yolo v3 you should use 105 - i.e. 105 layers [0 - 104], because convolutional layer-105 depends on number of classes (filters depends on number of classes):

MyVanitar commented 6 years ago

@AlexeyAB

Hi Alex,

if you could provide a short history of each commit and some description about it, you help us very much to understand what you do and learn something from you.

AlexeyAB commented 6 years ago

@VanitarNordic Hi,

There are 355 commits ahead original repo, I don't remember all of it :) There is some change log: https://github.com/AlexeyAB/darknet/issues/529#issuecomment-377204382

MyVanitar commented 6 years ago

@AlexeyAB

No I don't mean previous commits, I mean starting from now :-)

AlexeyAB commented 6 years ago

@VanitarNordic I will think about it for major fixes, may be I will create new Issue with Enhancement label for each major fix.

Usually there are minor fixes that are described in the name of commit. So it will be too much expensive to describe it somewhere else.

judwhite commented 6 years ago

@AlexeyAB I came across an issue today where I started getting nan's during training. I stepped back and 1700-2400 produced nan from the beginning of the iterations. I ran 1600 and everything was okay, and it's now progressed to 2400 again with everything still okay. No change to training data or configuration files. Let me know if you'd like to look at any of the weights files or other data.

AlexeyAB commented 6 years ago

@judwhite Yes, this happens rarely.

AlexeyAB / darknet

Questions Retraining YOLO #617