Training images terms and conditions, [net] layer parameters

MyVanitar commented 7 years ago

Hello,

I changed the weightand height of the [net] layers to 544, but this led the model to an strange behavior and as it was iterating more, an un-detection behavior became more powerful and finally it goes to no detection even on Training images!. Do you know why? Strange.

Besides when I look at YOLOv2 544x544 in the Darket website, its cfg files wrote the height and width as 416, why?!

Also I realize that input images' sizes do not affect the consumed GPU memory. This is unique for YOLO because big training images (in size or resolution) would easily lead to out of memory on other models.

MyVanitar commented 7 years ago

Unfortunately changing the width and heightto 544 led to -nan problem or 0 recall and very high error after few hundred of iterations. I could not solve it so far after many experiments. either by changing the batch size or subdivision. none worked.

AlexeyAB commented 7 years ago

@VanitarNordic

Detection works quite well at different resolutions 544x544 or 832x832, there is a guarantee: https://arxiv.org/pdf/1612.08242.pdf

Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size.

Training problems - this is strange, that you got it:

Unfortunately changing the width and heightto 544 led to -nan problem or 0 recall and very high error after few hundred of iterations.

Because there are some guarantees: https://arxiv.org/pdf/1612.08242.pdf

However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.

This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions.

And because if we train model only on 416x416, then all the same, the network resized at different resolutions each 10 iterations (batches), if we set random=1 in .cfg-file:

parser.c: l.random = option_find_int_quiet(options, "random", 0);

detector.c:

if(l.random && count++%10 == 0) { https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L81
dim = random value[320 - 608]: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L83
resize_network(nets + i, dim, dim); https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L96
- resize_network() here is code that resizes all layers: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/network.c#L322
}
and later each image resized to random network size image sized = resize_image(cropped, w, h);: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/data.c#L533

Try to use 416x416 but with random=1 for training: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244

MyVanitar commented 7 years ago

Try to use 416x416 but with random=1 for training: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244

You mean 544 544? Have you tried this resolution yourself? I use the same dataset for my experiments. My results are OK with 416 416 with the same dataset.
Besides after 2000 iterations (for one class), the detection results will not show up if I put threshold more than 0.7. For training images. if I put 0.9 as threshold, the detection will show up but it will disappear if I put 0.99.
it is difficult to say when it is a good time to stop iterations because there is no mAP value to see if it remains stable after some time.
besides, I believe we do training from scratch, not fine-tuning. because fine-tuning was most likely for classification and not detection. also because in fine-tuning usually the result should converge faster in higher mAP and mush less iterations. see here from the YOLO website:

Training YOLO on VOC

You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the Pascal VOC dataset.

then it says about the conv.23 weights and so on ...

AlexeyAB commented 7 years ago

@VanitarNordic

You mean 544 544? Have you tried this resolution yourself? I use the same dataset for my experiments. My results are OK with 416 416 with the same dataset.

No, I mean 416x416 with auto dynamic resolutions changing random=1. I don't tried to train on 544x544.

If you want use 544x544 with random=0 , then you can try to add this code:

        resize_network(nets + i, nets[i].w, nets[i].h);

between lines 40 and 41 in detector.c: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L40

May be it can solves your problem with training 544x544 model.

MyVanitar commented 7 years ago

so you mean if I put random=1, then I will get better results? I just wanted to train on 544 * 544 because of the higher mAP.

I have made some updates to the above reply either. Please have a look.

AlexeyAB commented 7 years ago

@VanitarNordic

Yes, random=1 gives better results but require ~2x3 more iterations.

besides, I believe we do training from scratch, not fine-tuning. because fine-tuning was most likely for classification and not detection. also because in fine-tuning usually the result should converge faster in higher mAP and mush less iterations. see here from the YOLO website:

Training YOLO on VOC You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the Pascal VOC dataset.

then it says about the conv.23 weights and so on ...

The website does not have to be extremely accurate in terms. But it is necessary in the article, in both casses classification & detection we do fine-tune: https://arxiv.org/pdf/1612.08242.pdf

For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP

In addition, there are well-established scientific terms and their definitions: https://github.com/AlexeyAB/darknet/issues/6#issuecomment-279486661

MyVanitar commented 7 years ago

Thanks :-)

Well I tried your suggestion and I get out of memory error on 6G GPU. let me change the batch or subdivision and see what happens.

Let me ask you this also, if we select 416 416 or 544 544, does it mean that our training images should be bigger than this size?

What about the case when trained network does not detect objects on higher thresholds although it has trained good enough?

AlexeyAB commented 7 years ago

About solving out of memory error on 6G GPU: https://github.com/AlexeyAB/darknet/issues/12#issuecomment-274343460

Yes, low-resolution lossy-compressed (jpg) images can greatly degrade the learning.

MyVanitar commented 7 years ago

Yesss, so my assumption was right. a good training image should not be smaller than the network defined size. That might be the reason of that above problem with 544 * 544. Who knows. I will check and comeback and share the experience.

I have one question more about the good annotation. Please remain hold till I post it. The answer is very important for me at the moment.

MyVanitar commented 7 years ago

in many training images, the target object is hidden partially within other objects. Then what is the best annotation technique there? cover hidden object areas inside rectangle or just cover visible areas? Please have look at below pictures:

Shape A

Shape B

Considering in Apple detection, which annotation is better to include that apple which is partially hidden?

AlexeyAB commented 7 years ago

Case-2 closer to the truth.

In short, the better to use that something an average between these two examples. As much as possible parts of the object should get into the bbox, and as little as possible parts of another objects. And such examples should be a little bit - where one object strongly overlaps another object.

In general, there is no perfect solution. The trade-off depends on the priority:

if you want to accurately determine the class of the object, the better average between case 1-2
if you want to accurately find the coordinates of the object, the better case 2

MyVanitar commented 7 years ago

Wow, thanks.

What about when we want to detect the object in tricky situations?, I mean when it is covered by another objects (such as an apple inside a hand).

is it good that the training images contain free & clear examples of the target object or it should contain some tricky training images also?

in the other word, if we train the model with the samples that target object is fully visible and free, does the model is be able to detect that object in tricky areas afterwards?

MyVanitar commented 7 years ago

within our correspondences, I trained the model with random=1 and the network lost its coverage and errors are very big. What's your assumption?

2017-02-18_2-32-12

AlexeyAB commented 7 years ago

Yes, if we train the model with the samples that target object is fully visible and clear, the model be able to detect overlaped objects.

About random=1, do you use correct image-dataset? How many classes in .cfg? Try to train classes*5*2000 iterations.

MyVanitar commented 7 years ago

So can we consider this as a rule of thumb that our training images should always (as much as we can) contain free, no overlapping and visible examples of the target object? Then the trained model will not have any problem to find that target in overlapping situations? (I was thinking before that it must be trained with overlapping images also to be able to detect, so I was wrong, wow) Please confirm.
from the beginning the classes=1 , no the dataset is identical, I just changed to random=1 and started training. actually I stopped training when I saw these numbers, should I still continue?

AlexeyAB commented 7 years ago

Yes. Not overlaped, but with different: scales, lighting, rotation.
random=1 always decrease mAP/recall on training dataset, but increase mAP/recall on valid dataset and requires more iterations. It increase mAP on valid dataset from 75.4 to 76.8.

yolov2

Also you can try this fix instead of random=1: https://github.com/AlexeyAB/darknet/issues/30#issuecomment-280774337

MyVanitar commented 7 years ago

You mentioned about training images which they should not contain objects with overlapping

What about validation images? they should be free and visible, or they should contain some overlapping? (I assume because they are test, they should contain overlapping, but you confirm the truth please, maybe I am wrong again)
I see another parameter named threshold=.6 above random, changing that does not make sense?

AlexeyAB commented 7 years ago

Yes, validation images can include overlaped objects, but it is not necessary.
- In general, validation images should be 10-20% randomly selected images from whole dataset, and the rest of the images 80-90% should be in training dataset. And in general, both training and validation dataset should has equal 5-10% of images with overlaped objects.
I don't know how it can be used to increase precision. I think if you will increase this thresh = .6 then you will get:
- less false-positives
- less true-positives
- results with more accurate bounded boxes

In details:

It used in forward_region_layer() to calculate delta_region_class() only when best_iou > l.thresh: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/region_layer.c#L235
in a similar manner as in the validate_detector_recall() where compare if(best_iou > iou_thresh) : https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/detector.c#L427

This specifies when the object is considered found, when IoU > 0.6:

68747470733a2f2f6873746f2e6f72672f66696c65732f6361382f3836362f6437362f63613838363664373666623834303232383934306462663434326137663036612e6a7067

MyVanitar commented 7 years ago

Thank you very much for the detailed explanation. You mean I should copy some images from the training dataset to the validation dataset? I mean the images inside the validation dataset should exist inside the training dataset?

let me tell you my situation by an example. I was detecting several false positives in camera detection and I include those scenes with false positive objects as images inside the training set again, and trained from the beginning, but no progress in reducing false detection. model was still detecting the same false positives.

You know, consider we gonna detect an apple for instance. if there are some images that the apple is in the hand of a person, the model would think that is also a part of the apple and will detect hand or even face as apple either. if I put an apple on a board, it will think the board is also part of it. this can be solved significantly by increasing the threshold, BUT it affects the detection of apple itself either, and makes the apple less sensitive to be detectable from far or make the detection blinking.

I trained the model till the best situation of error for both training and validation, but this phenomenon did not solve. Do you know what's the trick?

AlexeyAB commented 7 years ago

I mean if you have 10000 images in dataset, then:

randomly selected 9000 should be in training dataset
other 1000 shoud be in validation dataset

If in training dataset are apples on different backrounds, then Yolo learns:

each time repeated outlines (apple) - this is an object
each time different contours (hand, face, board) - this is a background Yolo will not detect background if there is not object.

In your training dataset should be as much background as possible.

How many images and classes?

If you have 10 000 images and 1 classes,
- then also you should decrease 20-fold 2 & 3 steps steps=100,1200,2000 - it will decrease learning-rate at this steps: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17
- and you should train about 10 000 iterations
Also, if you want to detect only apples on video/images with aspect ratio 4:3 as in 640x480, which always has square bound box, then you should change anchors: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L228 anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0

MyVanitar commented 7 years ago

You explained very important facts which I didn't know about them.

1) The images inside validation dataset should be found also inside training dataset or they should differ?

2) Should I also include only background inside training images? (images which contain no target object in them, only backgrounds, objects of no interest)

3) > Also, if you want to detect only apples on video/images with aspect ratio 4:3 as in 640x480, which always has square bound box, then you should change anchors: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L228 anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0

Okay, I didn't know about this anchor parameters. Can you tell me about this anchors and how should I calculate them by the input video resolution? should I change them also in the training phase?

4) > How many images and classes?

My dataset is very small, only 150 images on the training side ;-), but I make experiments and check the results with. besides all have different backgrounds. so I assume my experience is because of the low number of images maybe.

5) > Then also you should decrease 40-fold 2 & 3 steps steps=100,1200,2000 - it will decrease learning-rate at this steps: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17

I didn't know about this. I will be grateful if you explain this also more to adjust it for further training, if it is necessary.

AlexeyAB commented 7 years ago

Better if training-set and validation-set contains different images
Yes, you should add images without objects and bounded boxes. (In Yolo v1 with this was problems - this caused errors. But in Yolo v2 it seems to be solved)
Anchors - is proportions of proposed object sizes: width,height - relative sizes (0-13) to image, where image has size 13x13. Usually it calculated by using k-means for each dataset: Figure 2: https://arxiv.org/pdf/1612.08242.pdf
150 images is very small, this is exactly not enough for good results.

learning_rate - is speed of training. The weight changes at learning_rate * error for each iteration during training. Bigger learning_rate - faster training - faster occurs overfitting.
- Initial learning_rate=0.0001: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L14
- learning_rate will be changed at this iterations steps=100,25000,35000: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17
- learning_rate will be multiplied at this scales scales=10,.1,.1: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L18
- it happens here rate *= net.scales[i]; https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L62

I.e. learning_rate will be:

[0 - 100] iteration learning_rate will be 0.0001
[100 - 25000] iteration learning_rate will be 0.001
[25000 - 35000] iteration learning_rate will be 0.0001
[35000 - ...] iteration learning_rate will be 0.00001

But these ranges are good for VOC-dataset with 20 classes for 45 000 iterations. If you use 1 class with 2000 iterations - you should divide it at 20, to: steps=100,1200,2000 https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/cfg/yolo-voc.cfg#L17

MyVanitar commented 7 years ago

Thank you again.

1) > Yes, you should add images without objects and bounded boxes. (In Yolo v1 with this was problems - this caused errors. But in Yolo v2 it seems to be solved)

So does the annotation software make a blank file for a typical background-only image, I mean it creates .txt file for a background-only image, but writes nothing inside?

2) > But these ranges are good for VOC-dataset with 40 classes for 45 000 iterations. If you use 1 class with 2000 iterations - you should divide it at 40, to: steps=100,1200,2000

Would you please explain where the division by 40 has made? What about when we have 2 or 3 classes?

3) > Anchors - is proportions of proposed object sizes: width,height - relative sizes (0-13) to image, where image has size 13x13. Usually it calculated by using k-means for each dataset

I should modify these anchors for training also or just when I want to test the model?
I guess the best theoretical input image/video resolution would be equal to the network defined width and height (for example 416*416) or if it is bigger, it should be added by a multiplication of 64, as: [416 + (x * 64)], where x=2, equals to: 544 Am I right?
I appreciate if you explain a bit more. I read that part of the paper, but I did not understand how you calculated these numbers, Just I realize there are five anchors because k=5 is a good trade-off. we have aspect ratios of: 16:9 , 4:3 , 3:2 , 21:9.

AlexeyAB commented 7 years ago

Yes - creates .txt file for a background-only image, but writes nothing inside
Were steps=100,25000,35000, and are now steps=100,1200,2000 - second and third steps divided by 20 Why exactly 20? Initial steps are for VOC, where 20 classes. If you have 1 class, then divide by 20. If you have 2 classes, then divide by 20 = (20/2)
Anchors
- For training also - you should modify these anchors.
- Yes, would be equal to the network. Also the best theoretical input image/video resolution should be equal for training and test images
- My calculation is very approximate. If we strictly optimize only for apples and only for ratio 4:3. Based on the fact that:
- the apple bounded the square box - ratio 1:1
- video input resolution is 640x480 - ratio 4:3 (assume that the training images also have resolution 640x480 or at least the same ratio 4:3)
- then each anchors should have ratio 3:4: anchors = 1.33,1.0, 2.66,2.0, 4.0,3.0, 7.98,6.0, 10.64,8.0
- why the second values are: 1.0, 2.0, 3.0, 6.0 and 8.0 - I simply invented its that would be less than 13

MyVanitar commented 7 years ago

Thank you again.

1) > Were steps=100,25000,35000, and are now steps=100,1200,2000 - second and third steps divided by 40 Why exactly 40? Initial steps are for VOC, where 40 classes. If you have 1 class, then divide by 40. If you have 2 classes, then divide by 20 = (40/2)

Actually I asked this because 25000/40=625 , also 35000/40=875, is there any other calculations behind?

2) > Yes, would be equal to the network. Also the best theoretical input image/video resolution should be equal for training and test images

Okay, the if we intentionally change the resolution of the input videos/images, then would be better for detection, yes? is there any tool to do so? besides, I think if we do so, it will remove the need of using anchors. Yes?
if we make the training/validation images to have the same size as the network size (for example 416 * 416), then this leads to the better accuracy or doesn't matter?

3) > My calculation is very approximate. If we strictly optimize only for apples and only for ratio 4:3.

What about if we had two classes of banana and apple? What about if the class is only banana which is not square? Then no need to change the anchors in 4:3 condition?

AlexeyAB commented 7 years ago

I fixed it in my answer: VOC has 20 classes, and steps should be divided by 20 for 1 class. Correct is steps=100,1200,2000 - I began to make too many mistakes :)
It doesn't matter. Yolo itself resizes all images to network-size. Anchors solves two problems:
- different aspect ratio of input or training images
- different aspect ratio of objects
Then better to leave anchors value as they were

MyVanitar commented 7 years ago

It doesn't matter. Yolo itself resizes all images to network-size.

This does not affect the accuracy? because if it resize all images, many images will be stretchy, because aspect ratio is not kept when it decides to resize to a fixed size such as 416*416

AlexeyAB commented 7 years ago

In two cases will be the same accuracy:

If you use initial image (640x480) then Yolo converts 640x480 -> 416x416 and Yolo processes 416x416-image
If you use initial image (640x480) then you resizes it to 416x416 and Yolo processes 416x416-image

MyVanitar commented 7 years ago

So as the result, makes no difference in accuracy, even if we ourselves resize to the network size. so make the life easier and do not touch.

AlexeyAB commented 7 years ago

Yes.

MyVanitar commented 7 years ago

Alright, I tried to add the tips you mentioned about the background images, but the yolo-mark does not create blank .txt files for these images and also it does not include their names inside train.txt. Besides a small issue is still open with yolo-mark. I really appreciate if you consider.

MyVanitar commented 7 years ago

I also tried the effect of background-images. The result was immediate. it reduced the number of false positives significantly, although in 150 images . I just added around 15 backgrounds and checked the results. I am just waiting for YOLO_Mark to be fixed because I added these images/txt files manually. I'm not expert in C++ as you are and wrote Yolo_Mark, otherwise I was updating it. Besides, I want to know how can I use the results in making some GUI applications. Anything from using Darknet DLL in .Net, QT Creator, python or whatever. Besides, I want to send you something in private. would you please give your Email address?

AlexeyAB commented 7 years ago

I add feature to Yolo-mark to process backgroun-images without objects.

Also I sent email to you with my email.

AlexeyAB commented 7 years ago

@VanitarNordic I added Darknet as DLL: https://github.com/AlexeyAB/darknet/issues/27#issuecomment-286882940

AlexeyAB / darknet

Training images terms and conditions, [net] layer parameters #30