experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
MIT License
1.73k stars 784 forks source link

Question on pretrained weights #126

Closed thorstenwagner closed 6 years ago

thorstenwagner commented 6 years ago

Hi @experiencor ,

I've a question regarding the pretrained weights loaded in the backend. The one layer doing the classification is randomly initialized and during training the magnitude of modifications should be large. The pretrained weights of the backend layers (e.g. Full Yolo) are not frozen. I'm wondering why the weights of the backend are not destroyed due to the large modifications in the classification layer? Moreover, I've to rename the first layer in the backend because I'm using single channel images. That means for the first layer that there are no pretained weights available.

Considering all this, I am surprised that the net works very well on unseen data even with a very small amount of training data (~ 15 Images, 4096x4096, ~ 100 objects per images, only one class).

Maybe you can share some insights.

Best, Thorsten

experiencor commented 6 years ago

@thorstenwagner You are very right the weight of the last layer (the detection or classification layer). If it's too big, it will destroy the pretrained backend and often result in NaN loss. That's why in line 77 and 78 I need to scale down their values using the size of its input.

Regarding the weight of the first modified layer, the fact is that the first layer of a CNN would tend to form some kinds of edge detectors regardless of the output signal (even random output). So you run the training long enough, it would become a bunch of edge detectors anyway. This is mentioned in some lecture of Introduction to Convolutional Neural Networks for Visual Recognition - Standford. I don't know why it is the case though.

And I think that your dataset is not that small. In YOLO, we divide an image into grid cells and aim to fit 5 or so anchors in each cell to the object examples. So if we count the number of object examples as learning examples, we will have 1.5K examples for the network to learn in all the images.

thorstenwagner commented 6 years ago

Thank you very much for your clarifications!