Open pnambiar opened 6 years ago
We are training all the layers. But initially weights loaded from darknet53.conv.74
. For Yolo v3 we use darknet53.conv.74
instead of darknet19_448.conv.23
https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Learning rates - speed of training. https://www.quora.com/What-is-the-learning-rate-in-neural-networks
Thanks very much for the clarification. My question: Are the learning rates same for all the layers? All the layers are uniformly getting trained then ? Is it possible to fine-tune just the final layer by keeping the lr of the first layers 0? Does that give better results ?
@AlexeyAB Are there times when you should start with a new weights file?
Is it okay to update:
I had a weights file that was working well, then went from 38 to 50 classes. I updated all relevant fields (classes, filters, anchors in yolo.cfg, and classes in obj.data). When I re-ran it started decreasing loss at first then after a while it was all nan's. Any help appreciated, thanks.
@judwhite
Pre-trained weights slightly depends on Width and Height only. And depends on filters on the first 74 layers if is used file darknet53.conv.74
.
In your case with nans:
@pnambiar
Learning rates are same for all the layers and uniformly getting trained.
https://github.com/AlexeyAB/darknet#how-to-improve-object-detection
to speedup training (with decreasing detection accuracy) do Fine-Tuning instead of Transfer-Learning, set param stopbackward=1 in one of the penultimate convolutional layers before the 1-st [yolo]-layer, for example here: https://github.com/AlexeyAB/darknet/blob/0039fd26786ab5f71d5af725fc18b3f521e7acfd/cfg/yolov3.cfg#L598
@AlexeyAB When I start over (modifying classes, filters, anchors only - height/width are the same) it works, but if I use an old weights file created with less classes/different anchors I run into the nan issue after some iterations.
I double checked the config files. I created a template to update these files automatically (yolo, obj.data, train.txt) so I don't have copy/paste issues. 😄
What info would you need if I'm able to reproduce the nan issue I'm seeing? I assume the old and new yolo cfg, obj.data, train.txt, obj.names, the previous weights file, and the new training set. Anything else? Is there any debugging I can do on my side?
Please let me know if you'd prefer a new Issue, I thought this one was okay because it's about retraining YOLO. Thanks.
@judwhite You can compress and drug-n-drop to your message all files required for training, if it is not very big ) After how many iterations do all the "nan" occur? Do you get all nan, even avg loss is nan?
When I start over (modifying classes, filters, anchors only - height/width are the same) it works, but if I use an old weights file created with less classes/different anchors I run into the nan issue after some iterations.
Do you use whole old weights such as yolo-obj_10000.weights
or partially prepared pre-trained weights such as darknet53.conv.74
as can be created using these commands? https://github.com/AlexeyAB/darknet/blob/4d9a2bdac688f9c949b304dde8188a40efce1b49/build/darknet/x64/partial.cmd#L9-L30
@AlexeyAB It's 300MB and the weights file is about the same. I have hosting so no big deal I can put the train set/weights file up there.
After how many iterations do all the "nan" occur?
Not sure, I started it with an old 3000 (using batch=64, subdivisions=32) weights file and when I came back it was in the 5000's spitting out nans.
Do you get all nan, even avg loss is nan?
Yes
Do you use whole old weights such as yolo-obj_10000.weights or partially prepared pre-trained weights
Neither, I started from random to get to 3000, then used my 3000 once I had more labeled data.
@judwhite
Neither, I started from random to get to 3000, then used my 3000 once I had more labeled data.
Do you mean that?
darknet detector train obj.data yolo-obj.cfg
and train 3000 iterationsdarknet detector train obj.data yolo-obj.cfg backup/yolo-obj_3000.weights
This isn't the correct way. To use pre-trained weights you should leave only these layers which are the same in both models using partial command, for example
darknet.exe partial cfg/yolo-obj.cfg backup/yolo-obj_3000.weights yolo-obj.conv.23 23
darknet detector train obj.data yolo-obj.cfg yolo-obj.conv.23
@AlexeyAB Yes that's exactly what I was doing, thank you.
What's the significance of yolo-obj.conv.23
and 23
? I'm using YOLOv3.
@judwhite 23 - is a number of layers that you want to leave in the yolo-obj.conv.23
file.
For Yolo v3 you should use 105
- i.e. 105 layers [0 - 104]
, because convolutional layer-105 depends on number of classes (filters depends on number of classes):
@AlexeyAB
Hi Alex,
if you could provide a short history of each commit and some description about it, you help us very much to understand what you do and learn something from you.
@VanitarNordic Hi,
There are 355 commits ahead original repo, I don't remember all of it :) There is some change log: https://github.com/AlexeyAB/darknet/issues/529#issuecomment-377204382
@AlexeyAB
No I don't mean previous commits, I mean starting from now :-)
@VanitarNordic I will think about it for major fixes, may be I will create new Issue with Enhancement
label for each major fix.
Usually there are minor fixes that are described in the name of commit. So it will be too much expensive to describe it somewhere else.
@AlexeyAB I came across an issue today where I started getting nan's during training. I stepped back and 1700-2400 produced nan from the beginning of the iterations. I ran 1600 and everything was okay, and it's now progressed to 2400 again with everything still okay. No change to training data or configuration files. Let me know if you'd like to look at any of the weights files or other data.
@judwhite Yes, this happens rarely.
I have followed the procedure for ' How to train (to detect your custom objects):' and I'm able to get very good detection results. But I am trouble understanding how the 'retraining' process works. I am assuming
darknet19_448.conv.23 is the YOLO model trained on Imagenet. When we are training to detect custom object, are we training all the layers or just the last layers? If so what are the learning rates and where can I get this information?