Closed MyVanitar closed 7 years ago
Unfortunately changing the width
and height
to 544 led to -nan problem or 0 recall and very high error after few hundred of iterations. I could not solve it so far after many experiments. either by changing the batch size or subdivision. none worked.
@VanitarNordic
Detection works quite well at different resolutions 544x544 or 832x832, there is a guarantee: https://arxiv.org/pdf/1612.08242.pdf
Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size.
Training problems - this is strange, that you got it:
Unfortunately changing the width and heightto 544 led to -nan problem or 0 recall and very high error after few hundred of iterations.
Because there are some guarantees: https://arxiv.org/pdf/1612.08242.pdf
However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.
This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions.
And because if we train model only on 416x416, then all the same, the network resized at different resolutions each 10 iterations (batches), if we set random=1
in .cfg
-file:
parser.c: l.random = option_find_int_quiet(options, "random", 0);
detector.c:
if(l.random && count++%10 == 0) {
https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L81
dim
= random value[320 - 608]: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L83
resize_network(nets + i, dim, dim);
https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L96
resize_network()
here is code that resizes all layers: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/network.c#L322}
and later each image resized to random network size image sized = resize_image(cropped, w, h);
: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/data.c#L533
Try to use 416x416 but with random=1
for training: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244
Try to use 416x416 but with random=1 for training: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244
You mean 544 544? Have you tried this resolution yourself? I use the same dataset for my experiments. My results are OK with 416 416 with the same dataset.
Besides after 2000 iterations (for one class), the detection results will not show up if I put threshold more than 0.7. For training images. if I put 0.9 as threshold, the detection will show up but it will disappear if I put 0.99.
it is difficult to say when it is a good time to stop iterations because there is no mAP value to see if it remains stable after some time.
besides, I believe we do training from scratch, not fine-tuning. because fine-tuning was most likely for classification and not detection. also because in fine-tuning usually the result should converge faster in higher mAP and mush less iterations. see here from the YOLO website:
Training YOLO on VOC
You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the Pascal VOC dataset.
then it says about the conv.23 weights and so on ...
@VanitarNordic
You mean 544 544? Have you tried this resolution yourself? I use the same dataset for my experiments. My results are OK with 416 416 with the same dataset.
No, I mean 416x416 with auto dynamic resolutions changing random=1
. I don't tried to train on 544x544.
If you want use 544x544 with random=0
, then you can try to add this code:
resize_network(nets + i, nets[i].w, nets[i].h);
between lines 40 and 41 in detector.c: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L40
May be it can solves your problem with training 544x544 model.
so you mean if I put random=1
, then I will get better results? I just wanted to train on 544 * 544 because of the higher mAP.
I have made some updates to the above reply either. Please have a look.
@VanitarNordic
Yes, random=1
gives better results but require ~2x3 more iterations.
besides, I believe we do training from scratch, not fine-tuning. because fine-tuning was most likely for classification and not detection. also because in fine-tuning usually the result should converge faster in higher mAP and mush less iterations. see here from the YOLO website:
Training YOLO on VOC You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the Pascal VOC dataset.
then it says about the conv.23 weights and so on ...
The website does not have to be extremely accurate in terms. But it is necessary in the article, in both casses classification & detection we do fine-tune: https://arxiv.org/pdf/1612.08242.pdf
For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP
In addition, there are well-established scientific terms and their definitions: https://github.com/AlexeyAB/darknet/issues/6#issuecomment-279486661
Thanks :-)
Well I tried your suggestion and I get out of memory error on 6G GPU. let me change the batch or subdivision and see what happens.
Let me ask you this also, if we select 416 416 or 544 544, does it mean that our training images should be bigger than this size?
What about the case when trained network does not detect objects on higher thresholds although it has trained good enough?
About solving out of memory error on 6G GPU: https://github.com/AlexeyAB/darknet/issues/12#issuecomment-274343460
Yes, low-resolution lossy-compressed (jpg) images can greatly degrade the learning.
Yesss, so my assumption was right. a good training image should not be smaller than the network defined size. That might be the reason of that above problem with 544 * 544. Who knows. I will check and comeback and share the experience.
I have one question more about the good annotation. Please remain hold till I post it. The answer is very important for me at the moment.
in many training images, the target object is hidden partially within other objects. Then what is the best annotation technique there? cover hidden object areas inside rectangle or just cover visible areas? Please have look at below pictures:
Shape A
Shape B
Considering in Apple detection, which annotation is better to include that apple which is partially hidden?
Case-2 closer to the truth.
In short, the better to use that something an average between these two examples. As much as possible parts of the object should get into the bbox, and as little as possible parts of another objects. And such examples should be a little bit - where one object strongly overlaps another object.
In general, there is no perfect solution. The trade-off depends on the priority:
Wow, thanks.
What about when we want to detect the object in tricky situations?, I mean when it is covered by another objects (such as an apple inside a hand).
is it good that the training images contain free & clear examples of the target object or it should contain some tricky training images also?
in the other word, if we train the model with the samples that target object is fully visible and free, does the model is be able to detect that object in tricky areas afterwards?
within our correspondences, I trained the model with random=1
and the network lost its coverage and errors are very big. What's your assumption?
Yes, if we train the model with the samples that target object is fully visible and clear, the model be able to detect overlaped objects.
About random=1
, do you use correct image-dataset? How many classes in .cfg
? Try to train classes*5*2000
iterations.
So can we consider this as a rule of thumb that our training images should always (as much as we can) contain free, no overlapping and visible examples of the target object? Then the trained model will not have any problem to find that target in overlapping situations? (I was thinking before that it must be trained with overlapping images also to be able to detect, so I was wrong, wow) Please confirm.
from the beginning the classes=1
, no the dataset is identical, I just changed to random=1
and started training. actually I stopped training when I saw these numbers, should I still continue?
Yes. Not overlaped, but with different: scales, lighting, rotation.
random=1
always decrease mAP/recall on training dataset, but increase mAP/recall on valid dataset and requires more iterations. It increase mAP on valid dataset from 75.4 to 76.8.
Also you can try this fix instead of random=1
: https://github.com/AlexeyAB/darknet/issues/30#issuecomment-280774337
You mentioned about training images which they should not contain objects with overlapping
What about validation images? they should be free and visible, or they should contain some overlapping? (I assume because they are test, they should contain overlapping, but you confirm the truth please, maybe I am wrong again)
I see another parameter named threshold=.6 above random, changing that does not make sense?
Yes, validation images can include overlaped objects, but it is not necessary.
I don't know how it can be used to increase precision. I think if you will increase this thresh = .6
then you will get:
In details:
It used in forward_region_layer()
to calculate delta_region_class()
only when best_iou > l.thresh
: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/region_layer.c#L235
in a similar manner as in the validate_detector_recall()
where compare if(best_iou > iou_thresh)
: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/detector.c#L427
This specifies when the object is considered found, when IoU > 0.6
:
Thank you very much for the detailed explanation. You mean I should copy some images from the training dataset to the validation dataset? I mean the images inside the validation dataset should exist inside the training dataset?
let me tell you my situation by an example. I was detecting several false positives in camera detection and I include those scenes with false positive objects as images inside the training set again, and trained from the beginning, but no progress in reducing false detection. model was still detecting the same false positives.
You know, consider we gonna detect an apple for instance. if there are some images that the apple is in the hand of a person, the model would think that is also a part of the apple and will detect hand or even face as apple either. if I put an apple on a board, it will think the board is also part of it. this can be solved significantly by increasing the threshold, BUT it affects the detection of apple itself either, and makes the apple less sensitive to be detectable from far or make the detection blinking.
I trained the model till the best situation of error for both training and validation, but this phenomenon did not solve. Do you know what's the trick?
I mean if you have 10000 images in dataset, then:
If in training dataset are apples on different backrounds, then Yolo learns:
In your training dataset should be as much background as possible.
How many images and classes?
If you have 10 000 images and 1 classes,
steps=100,1200,2000
- it will decrease learning-rate at this steps: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17Also, if you want to detect only apples on video/images with aspect ratio 4:3 as in 640x480, which always has square bound box, then you should change anchors: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L228
anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0
You explained very important facts which I didn't know about them.
1) The images inside validation dataset should be found also inside training dataset or they should differ?
2) Should I also include only background inside training images? (images which contain no target object in them, only backgrounds, objects of no interest)
3) > Also, if you want to detect only apples on video/images with aspect ratio 4:3 as in 640x480, which always has square bound box, then you should change anchors: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L228 anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0
Okay, I didn't know about this anchor parameters. Can you tell me about this anchors and how should I calculate them by the input video resolution? should I change them also in the training phase?
4) > How many images and classes?
My dataset is very small, only 150 images on the training side ;-), but I make experiments and check the results with. besides all have different backgrounds. so I assume my experience is because of the low number of images maybe.
5) > Then also you should decrease 40-fold 2 & 3 steps steps=100,1200,2000 - it will decrease learning-rate at this steps: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17
I didn't know about this. I will be grateful if you explain this also more to adjust it for further training, if it is necessary.
Better if training-set and validation-set contains different images
Yes, you should add images without objects and bounded boxes. (In Yolo v1 with this was problems - this caused errors. But in Yolo v2 it seems to be solved)
Anchors - is proportions of proposed object sizes: width,height
- relative sizes (0-13) to image, where image has size 13x13. Usually it calculated by using k-means for each dataset: Figure 2: https://arxiv.org/pdf/1612.08242.pdf
150 images is very small, this is exactly not enough for good results.
learning_rate
- is speed of training. The weight changes at learning_rate * error
for each iteration during training. Bigger learning_rate
- faster training - faster occurs overfitting.
learning_rate=0.0001
: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L14learning_rate
will be changed at this iterations steps=100,25000,35000
: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17learning_rate
will be multiplied at this scales scales=10,.1,.1
: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L18rate *= net.scales[i];
https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L62 I.e. learning_rate
will be:
learning_rate
will be 0.0001learning_rate
will be 0.001learning_rate
will be 0.0001learning_rate
will be 0.00001But these ranges are good for VOC-dataset with 20 classes for 45 000 iterations. If you use 1 class with 2000 iterations - you should divide it at 20, to: steps=100,1200,2000
https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17
Thank you again.
1) > Yes, you should add images without objects and bounded boxes. (In Yolo v1 with this was problems - this caused errors. But in Yolo v2 it seems to be solved)
So does the annotation software make a blank file for a typical background-only image, I mean it creates .txt
file for a background-only image, but writes nothing inside?
2) > But these ranges are good for VOC-dataset with 40 classes for 45 000 iterations. If you use 1 class with 2000 iterations - you should divide it at 40, to: steps=100,1200,2000
Would you please explain where the division by 40 has made? What about when we have 2 or 3 classes?
3) > Anchors - is proportions of proposed object sizes: width,height - relative sizes (0-13) to image, where image has size 13x13. Usually it calculated by using k-means for each dataset
I should modify these anchors for training also or just when I want to test the model?
I guess the best theoretical input image/video resolution would be equal to the network defined width and height (for example 416*416
) or if it is bigger, it should be added by a multiplication of 64, as:
[416 + (x * 64)]
, where x=2, equals to: 544
Am I right?
I appreciate if you explain a bit more. I read that part of the paper, but I did not understand how you calculated these numbers, Just I realize there are five anchors because k=5
is a good trade-off. we have aspect ratios of: 16:9 , 4:3 , 3:2 , 21:9.
Yes - creates .txt file for a background-only image, but writes nothing inside
Were steps=100,25000,35000
, and are now steps=100,1200,2000
- second and third steps divided by 20
Why exactly 20? Initial steps are for VOC, where 20 classes. If you have 1 class, then divide by 20.
If you have 2 classes, then divide by 20
= (20/2)
Anchors
1:1
4:3
(assume that the training images also have resolution 640x480 or at least the same ratio 4:3
)anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0
Thank you again.
1) > Were steps=100,25000,35000, and are now steps=100,1200,2000 - second and third steps divided by 40 Why exactly 40? Initial steps are for VOC, where 40 classes. If you have 1 class, then divide by 40. If you have 2 classes, then divide by 20 = (40/2)
Actually I asked this because 25000/40=625 , also 35000/40=875, is there any other calculations behind?
2) > Yes, would be equal to the network. Also the best theoretical input image/video resolution should be equal for training and test images
Okay, the if we intentionally change the resolution of the input videos/images, then would be better for detection, yes? is there any tool to do so? besides, I think if we do so, it will remove the need of using anchors. Yes?
if we make the training/validation images to have the same size as the network size (for example 416 * 416), then this leads to the better accuracy or doesn't matter?
3) > My calculation is very approximate. If we strictly optimize only for apples and only for ratio 4:3.
What about if we had two classes of banana and apple? What about if the class is only banana which is not square? Then no need to change the anchors in 4:3 condition?
steps=100,1200,2000
- I began to make too many mistakes :)It doesn't matter. Yolo itself resizes all images to network-size.
This does not affect the accuracy? because if it resize all images, many images will be stretchy, because aspect ratio is not kept when it decides to resize to a fixed size such as 416*416
In two cases will be the same accuracy:
So as the result, makes no difference in accuracy, even if we ourselves resize to the network size. so make the life easier and do not touch.
Yes.
Alright, I tried to add the tips you mentioned about the background images, but the yolo-mark does not create blank .txt
files for these images and also it does not include their names inside train.txt
. Besides a small issue is still open with yolo-mark. I really appreciate if you consider.
I also tried the effect of background-images. The result was immediate. it reduced the number of false positives significantly, although in 150 images . I just added around 15 backgrounds and checked the results. I am just waiting for YOLO_Mark to be fixed because I added these images/txt files manually. I'm not expert in C++ as you are and wrote Yolo_Mark, otherwise I was updating it. Besides, I want to know how can I use the results in making some GUI applications. Anything from using Darknet DLL in .Net, QT Creator, python or whatever. Besides, I want to send you something in private. would you please give your Email address?
I add feature to Yolo-mark to process backgroun-images without objects.
Also I sent email to you with my email.
@VanitarNordic I added Darknet as DLL: https://github.com/AlexeyAB/darknet/issues/27#issuecomment-286882940
Hello,
I changed the
weight
andheight
of the[net]
layers to 544, but this led the model to an strange behavior and as it was iterating more, an un-detection behavior became more powerful and finally it goes to no detection even on Training images!. Do you know why? Strange.Besides when I look at
YOLOv2 544x544
in the Darket website, itscfg
files wrote theheight
and width as 416, why?!Also I realize that input images' sizes do not affect the consumed GPU memory. This is unique for YOLO because big training images (in size or resolution) would easily lead to out of memory on other models.