How to train small object with limited training data?

blake-ding commented 5 years ago

Hi Sir,

I try to train a small object (only one class) detector, but I only have limited data (1x images). Each image has thousands objects. My question as below :

If a image includes 1000 objects. 1.1 should I label all 1000 objects? 1.2 or cutting the image to several small size image. e.g. 20 images, 50 label per image; even 1000 images, 1 label per image. Does the above methods affect the training results or the convergence speed?
For one class object detector, should I add negative data(background image without labeling). If yes, what proportion between positive and negative data?
If our training data are high resolution (e.g. 1080P), should we do some pre-process before training?

BR

AlexeyAB commented 5 years ago

@blake-ding Hi,

1.1 You must label all 1000 objects. Also you should set max=1000 in each [yolo]-layer in cfg-file.

1.2 Relative sizes of objects in Training dataset should be the same as you want to Detect https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

General rule - your training dataset should include such a set of relative sizes of objects that you want to detect:

train_network_width train_obj_width / train_image_width ~= detection_network_width detection_obj_width / detection_image_width train_network_height train_obj_height / train_image_height ~= detection_network_height detection_obj_height / detection_image_height

I.e. for each object from Test dataset there must be at least 1 object in the Training dataset with the same class_id and about the same relative size:

object width in percent from Training dataset ~= object width in percent from Test dataset

That is, if only objects that occupied 80-90% of the image were present in the training set, then the trained network will not be able to detect objects that occupy 1-10% of the image.

Yes, it is very disarable to add negative samples. Proportion 1:1
You shouldn't do pre-processing. You should use high resolution of network (higher width height, but must be multiple of 32).

Also you should recalculate anchors.

blake-ding commented 5 years ago

@AlexeyAB ,

Thanks a ton for your reply. For the above link, I see the following suggestions for small object.

for training for small objects - set layers = -1, 11 and set stride=4 instead of

But the training flow is stop at the beginning after I do the above modification. Am I missing something?

BR

AlexeyAB commented 5 years ago

@blake-ding It seems you are doing something wrong, or you use other cfg than yolov3.cfg, or you don't have enough GPU-RAM.

You can try to train this cfg-file without such changes: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3_5l.cfg Just change classes and filters in each of 5 [yolo] layers.

blake-ding commented 5 years ago

@AlexeyAB

Thanks again. I already found the problem for the config issue of [route] layers = ... ; I just modified the parameter in wrong line (instead of line 717&720).

Another question, I tried to generate our own anchors, but the output seems strange as below : 0,0, 0,0, 0,0, 0,0, 10,12, 11,13, 16,15, 18,22, 48,50 Is it normal with 0,0 ?

For re-generating anchors, should we use both positive data and negative data or positive data only?

BR

AlexeyAB commented 5 years ago

@blake-ding

Is it normal with 0,0 ?

No, it is not normal. Check your dataset by using: https://github.com/AlexeyAB/Yolo_mark It seems that several objects have size 0x0.

For re-generating anchors, should we use both positive data and negative data or positive data only?

It doesn't matter. Result will be the same.

blake-ding commented 5 years ago

@AlexeyAB

I used a 3rd-party tool - LabelImg to generate label info. and used another script to generate yolo format. I also used Yolo_mark to double check our image and label info., it looks normal. I tried to re-generate anchors, but the results sometimes appeared 0,0 as below. //-------------------------- avg IoU = 89.49 % Saving anchors to the file: anchors.txt anchors = 0, 0, 0, 0, 10, 12, 10, 12, 11, 13, 17, 16, 19, 24, 32, 35, 52, 55 //-------------------------- avg IoU = 91.72 % Saving anchors to the file: anchors.txt anchors = 0, 0, 0, 0, 13, 16, 20, 19, 22, 24, 23, 29, 26, 35, 55, 56, 108,126 //-------------------------- avg IoU = 93.39 % Saving anchors to the file: anchors.txt anchors = 10, 12, 10, 13, 11, 13, 16, 15, 18, 21, 20, 27, 33, 35, 48, 49, 88,102 //-------------------------- I ran it many times, most of the results included zero.

I checked all .txt files, I can't find any 0x0 size in label data. Should I check it in other ways? I can also get the normal result without zero, or may I use this result of anchors directly

Thank you very much. BR

AlexeyAB commented 5 years ago

@blake-ding

Can you compress (zip) your train.txt file and all txt-labels files, and attach this zip-file to your message?

blake-ding commented 5 years ago

@AlexeyAB

Please see the attached file - txt files and train.txt Thanks.

BR txtFiles.zip

AlexeyAB commented 5 years ago

@blake-ding I fixed bug in calc_anchors. Try to update your code from GitHub.

blake-ding commented 5 years ago

@AlexeyAB

The calc_anchors was worked fine after updating the code, thanks.

For the previous question.

For one class object detector, should I add negative data(background image without labeling). If yes, what proportion between positive and negative data?

Your reply about the proportion is 1:1. If our target is 5 classes object detector, my question as below :

If we have total 500 images of positive data (with all 5 classes), therefore, we also need 500 negative data without any label; right?
What the proportion between these positive data? 2.1 If the proportion between the positive data is too big. e.g. 5 classes, total 500 images, the proportion as below : => 120:120:120:120:20 => 420:20:20:20:20 For the training results, is there any influences?
For the above condition, is there any difference between object detector and classifier?
I already training 166xx iterations (number of batch), but the avg loss still floating between 10 ~ 20,;although the avg loss is decreasing slowly. Does our avg loss is normal after 166xx iterations?

Btw, for my questions, is there any paper could be referred to? Thank you very much.

BR

AlexeyAB commented 5 years ago

@blake-ding Hi,

Yes. 500 images of positive data (with all 5 classes) and 500 negative
Yes, there are some influences. It will partially solved by using decay that is used by default: https://github.com/AlexeyAB/darknet/issues/1943#issuecomment-439560675 You can help to solve it by eliminating this imbalance.
No difference.
It can be normal only if you use very hard model like yolov3_5l.cfg with hard dataset. Do you train with flag -map? https://github.com/AlexeyAB/darknet#when-should-i-stop-training What mAp can you get now?

Btw, for my questions, is there any paper could be referred to?

I don't know a single paper that describes all of this.

blake-ding commented 5 years ago

@AlexeyAB

Thank you very much for your kindly reply. I training with 416x416 (batch 64/64) yolo3_5l.cfg again due to out of memory issue (608x608 and batch 64/64) last week. I will update mAP later, thanks.

Q1. As I asked before, if our object is small, similar and many in number (e.g. 2000 objects per image). The labeling job is hard and tired. For such situation, could we extract several objects from original image with the same resolution and use these partial images as training data?

Q2. If I configured the wh = 416416 in yolo3.cfg and training data A is 10241024. For the "resize" process, does YOLO3 resize A from original image size (10241024)? Or YOLO3 will resize all training data to 416416 first then resize A'(416416) to other random size (e.g. 608*608)?

BR

AlexeyAB commented 5 years ago

Q1 Can you show an example?

Q2 Every image will be resized to the current network size If you set width=416 height=416 random=1 and current network size is 608x608, then the image will be resized from 1024x1024 -> 608x608.

blake-ding commented 5 years ago

@AlexeyAB ,

I tracked the source code now, I got several questions as below :

In parser.c (if batch=64 / subdivision=64) net->time_steps = option_find_int_quiet(options, "time_steps",1); net->batch /= subdivs; //net.batch = 1 net->batch = net->time_steps; //time_steps undefined, net.batch = 1 And In detector.c const int actual_batch_size = net.batch net.subdivisions; Therefore, actual_batch_size = 64/64x64, I can't catch the purpose, why?
In data.c In function fill_truth_region() ... randomize_boxes(boxes, count); // <-- this function just swaps the order of label info, , I don't know why? correct_boxes(boxes, count, dx, dy, sx, sy, flip);
If we configure the max=2000 in [yolo] layer In data.c load_data_detection() -> d.y = make_matrix(n, 5*boxes); It means we need to assign 5x2000 fixed space, does it cause out of memory problem?

Thank you.

BR

AlexeyAB commented 5 years ago

@blake-ding

Therefore, actual_batch_size = 64/64x64, I can't catch the purpose, why?

actual_batch_size = net.batch * net.subdivisions just is used only to show batch= value that is specified in cfg-file.

net.batch * net.subdivisions - is used to update the weights

net.batch - is used for forward-backward inference.

randomize_boxes(boxes, count); // <-- this function just swaps the order of label info, , I don't know why?

If you have f.e. 100 labels on one image, and max=90 by default, then will be used only the first 90 labels, so randomize_boxes() allows to use different 90 labels each time.

If we configure the max=2000 in [yolo] layer In data.c load_data_detection() -> d.y = make_matrix(n, 5*boxes); It means we need to assign 5x2000 fixed space, does it cause out of memory problem?

It allocates 5x2000x sizeof(float) ~= 40 KB space. There is no out of memory problem. What is the fixed space? It allocates memory on heap by using calloc(): https://github.com/AlexeyAB/darknet/blob/2d747cab2bf94d3d81a562bc49614ac2c4e661bf/src/matrix.c#L76-L87

blake-ding commented 5 years ago

@AlexeyAB ,

Sorry for the late update.

For my previous question : Q1.

As I asked before, if our object is small, similar and many in number (e.g. 2000 objects per image). The labeling job is hard and tired. For such situation, could we extract several objects from original image with the same resolution and use these partial images as training data?

[AlexeyAB] Can you show an example? => My purpose is train a model to detect all honeycombs If I only have 20 images. ( as the attached images )

Q1.1 Could I separate several honeycombs or even one honeycomb with other background as a new training data? A : One image (a1) includes 2000 labeling data. B : Separates image (a1) to 2000 small images that includes only one honeycomb with other background. Does there any difference/influence between A and B for training? Q1.2 Actually, I still tracing the code and want to find out there is any difference when Conv. layer extracts the features from above A and B methods?

Q2

If we configure the max=2000 in [yolo] layer In data.c load_data_detection() -> d.y = make_matrix(n, 5*boxes); It means we need to assign 5x2000 fixed space, does it cause out of memory problem?

[AlexeyAB] It allocates 5x2000x sizeof(float) ~= 40 KB space. There is no out of memory problem. What is the fixed space?

If the batch size is 64, it means we need to allocate 64x40 KB = 2560 KB for labeling data of each images (e.g. batch = 64), right? When will we release d.y, end of each epoch? And re-allocate again?

New Question Q3 In network.c

    for(i = 0; i < net.n; ++i){
        if(net.layers[i].cost){
            sum += net.layers[i].cost[0];
            ++count;
        }
    }
    return sum/count;

What is the intention of dividing by count(3)? Accelerate calculation?

BR

AlexeyAB / darknet

How to train small object with limited training data? #2215