Tiny YOLO: Looking for suggestions to improve training on a custom dataset

saihv commented 6 years ago

I am currently working on object detection on a custom dataset, where a close to real time implementation on a Jetson TX2 is the final goal. Hence, I am trying to achieve a performance of ~30 fps (20-30 would be acceptable too as long as accuracy is not too bad) as well as a decent IoU.

As of now, I am using Tiny YOLO as my framework through Darknet, compiled with GPU and CUDNN support. The images are 640x360 in dimensions and I have about 100000 of them with around 10 classes of objects in total. I've trained tiny YOLO for about 80000 iterations and on an average, this has given me IoUs of around 50% on the test dataset with a performance of around 18 fps on the Jetson TX2: I am currently looking to improve these numbers while not affecting the performance too much. I was hoping to get some suggestions regarding this:

What steps can I take to 'customize' training to my dataset? Because I have multiple classes of objects, some of them are very small (bounding boxes of 50x50 pixels in size approx.); and tiny YOLO is having a lot of trouble specifically with these small objects, while performing decently on the bigger ones. Can I somehow retrain my network to focus on these small objects more? Or are there any modifications I can make in the cfg file to account for these small objects?

(I see in the README two points relating to this: parameters small_object=1 and random=1. Do these affect the performance adversely at the cost of increased accuracy?)

Does YOLO have a performance boost when working on square images? i.e., is there any noticeable improvement in resizing the images to square?
Is IoU the best metric to check when trying to increase or derease network-resolution (width and height)? I gather from the README that these values create an accuracy vs speed trade-off, how should I pick the best values for my application?
When performing training or inference, in my application, each image has only one class of object in it. Can I somehow exploit this fact to improve the performance a little (somehow tell YOLO that the maximum number of objects it needs to finally detect is just one)?

Any other general comments aimed at improving accuracy or speed are very welcome too. Thanks!

AlexeyAB commented 6 years ago

Did you get IoU using darknet map or darknet recall command?
What width= height= params do you use in the cfg-file?
What learning_rate, steps, scales and decay do you use?

You can use small_object=1 and random=1 this params doesn't decrease speed:
- random=1 increase mAP +1%, it does not affect the speed of detection, but it deacrease speed of training
- small_object=1 is required only for objects with size less than 1%x1% i.e. smaller than 5x5 pixels (if you use width=416 height=416)
- also you can try to train using pre-trained tiny-yolo-voc.conv.13 instead of darknet19_448.conv.23 that you can get using command: darknet.exe partial cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights tiny-yolo-voc.conv.13 13
By default Yolo uses square network 416x416, and any image is resized to this square resolution 416x416 automatically, so you shouldn't do it. But there are several approaches for keeping aspect ration, so you can do pre-processing of images, as in the original darknet, or as in the OpenCV-dnn-Yolo: https://github.com/AlexeyAB/darknet/issues/232#issuecomment-336955485 But there are positive and negative points here.
For default networks (Yolo, Tiny-yolo) and default threshold=0.24, the IoU is the best accuracy metric. But if you use your own model (DenseNet-Yolo, ResNet-Yolo), that requires a different optimal threshold, then the best metric is mAP. Yes, the higher the network resolution, the slower it works, but the more accurately it detects (especially small objects).

3.1. Also, if all of your images (training and detection) have the same size 640x360, then you can try to change your network size width=640 height=352 and train with random=0
You can try to implement it in the source code in this function: https://github.com/AlexeyAB/darknet/blob/3ff4797b4cfd845c7115421e68ae2d584c289a24/src/region_layer.c#L333-L384

For example add this code at the end of the function, before this line: https://github.com/AlexeyAB/darknet/blob/3ff4797b4cfd845c7115421e68ae2d584c289a24/src/region_layer.c#L384

float max_prob = 0;
int max_index = 0, max_j = 0;
    int i,j,n;
    for (i = 0; i < l.w*l.h; ++i){
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){
            int index = i*l.n + n;
                for(j = 0; j < l.classes; ++j){
                    if(probs[index][j] > max_prob) {
                        max_prob = probs[index][j];
                        max_index = index;
                        max_j = j;
                    }
                }
       }
    }

    for (i = 0; i < l.w*l.h; ++i){
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){
            int index = i*l.n + n;
                for(j = 0; j < l.classes; ++j){
                    if(index != max_index || j != max_j) probs[index][j] = 0;
                }
       }
    }

Also you can re-generate anchors for your dataset:
- Just download Python 2.7: https://www.python.org/downloads/release/python-2714/
- insteall numpy if it required: C:\Python27\Scripts\pip install numpy
- calculate your anchors C:\Python27\python.exe gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 5 using this script: https://github.com/AlexeyAB/darknet/blob/master/build/darknet/x64/gen_anchors.py
- and change anchors in cfg-file to your: https://github.com/AlexeyAB/darknet/blob/master/cfg/tiny-yolo-voc.cfg#L118

MyVanitar commented 6 years ago

is it a good idea to pad images during preprocessing to make it compatible with 416 * 416?

I don't mean resizing them all to 416 * 416, but pad them in a way to make a correct fraction of 416, because if the network resizes them all to 416 * 416, many images which are not dividable by 416, will lose their aspect ratio. such as 300 * 300.

AlexeyAB commented 6 years ago

@VanitarNordic There are positive and negative points here for each case: https://github.com/AlexeyAB/darknet/issues/232#issuecomment-336955485

original Darknet: (+) keep aspect ratio, (-) has the smallest size of object - this further worsens the detection of small objects
OpenCV-dnn-Yolo: (+) keep aspect ratio, (-) removes part of the image - you will not be able to detect objects at the edges of the image
this Darknet repo: (+) object has the biggest size, (-) does not keep aspect ratio - If the sizes of the images in training and detection datasets are very different, then the accuracy will be reduced

Because I train my models on training dataset with the same image size (1280x720 or 1920x1080) as sizes of detection dataset, then I shouldn't keep aspect ratio, so for me the best option is to use this Darknet repository with the maximum object size.

MyVanitar commented 6 years ago

Okay, therefore this Darknet Repo does not keep the aspect ratio. I think it is the same with SSD. if I pad the image to build a correct fraction or multiplication of 416, then is it good (for this repo)?

AlexeyAB commented 6 years ago

@VanitarNordic

if I pad the image to build a correct fraction or multiplication of 416, then is it good (for this repo)?

Do you mean, that you will do as in original Darknet, but by yourself? It will keep aspect ratio, but you will have the smallest size of object. If you have small object - then this is bad idea. But if you have big objects, but all your images has different size - then this is good idea.

MyVanitar commented 6 years ago

Do you mean, that you will do as in original Darknet, but by yourself?

By doing it myself outside before starting the training, You said this repo does not keep the aspect ratio, therefore I want to do the padding outside to make a correct fraction or multiplication of 416 for all images, by the padding method. Therefore even if the network does not keep the aspect ratio, but because of the correct numbers, then the object will not have un-normal shapes.

saihv commented 6 years ago

@AlexeyAB

Thanks a lot for the detailed reply! I will note your suggestions. Replies:

Did you get IoU using darknet map or darknet recall command?

I used darknet recall. But the 50% IoU I mentioned was on the test dataset, not validation. Validation IoU (the last line in the output) was about 65% IIRC.

What width= height= params do you use in the cfg-file?

As of now, just the defaults: 416x416.

What learning_rate, steps, scales and decay do you use?

momentum=0.9
decay=0.0005

learning_rate=0.001
policy=steps
steps=-1,100,80000,100000
scales=.1,10,.1,.1

saihv commented 6 years ago

@AlexeyAB

Thanks a lot for the tips! The one that made the biggest difference was using 640x352 with random=0. Strangely, regenerating the anchors actually reduced the IoU (and map). Is this possible?

Also, would you happen to have any tips for improving training only on certain classes? My training data is somewhat unbalanced: some classes have a lot more images than others, and running detector map looks like this:

detections_count = 35981, unique_truth_count = 16923  
class_id = 0, name = boat, 981   ap = 100.00 % 
class_id = 1, name = building,      ap = 79.95 % 
class_id = 2, name = car,      ap = 90.91 % 
class_id = 3, name = drone,      ap = 90.91 % 
class_id = 4, name = group,      ap = 80.07 % 
class_id = 5, name = horseride,      ap = 90.91 % 
class_id = 6, name = paraglider,      ap = 100.00 % 
class_id = 7, name = person,      ap = 90.91 % 
class_id = 8, name = riding,      ap = 90.91 % 
class_id = 9, name = truck,      ap = 72.41 %       // Slightly lower iou/precision on this class for example
class_id = 10, name = wakeboard,      ap = 83.83 % 
class_id = 11, name = whale,      ap = 100.00 %

Although the map/IoU looks really good on validation, it is slightly lower on test data: so I am just curious if I can improve training for only specific classes.

AlexeyAB commented 6 years ago

@saihv Simple solution is to do many duplicates of images+labels of classes which have small number of images. Then re-generate train.txt by using Yolo_mark. Due to data augmentation even duplicates of images+labels will increase accuracy.

saihv commented 6 years ago

Got it, thank you! I will try that.

Just one last question in the custom dataset area, if you don't mind:

I am working on an object detection contest, where I only have access to training data. I am supposed to train a model, which is then evaluated on a test set with the same classes (I don't have access to these test images); and the evaluation metric is the average IoU. I am splitting the given data into train and validation (as usual) and training my tiny YOLO model: but there is a noticeably big difference in IoU between validation and test (avg. 80% on validation vs 60% on test).

I guess this could be because of multiple reasons: the test data might be more challenging, or perhaps it has a different distribution of images per class etc. But conceptually, this seems like a tricky problem because the model does perform well in validation, but it still looks like there is some overfitting when it comes to new data. So that makes me curious, are there any tips or tricks to making a model more generalized? Thanks!

AlexeyAB commented 6 years ago

the test data might be more challenging, or perhaps it has a different distribution of images per class etc.

Yes.

So that makes me curious, are there any tips or tricks to making a model more generalized?

Increase params in the data augmentation and train 10x times more iterations. random=1 jitter=0.4 increase width and height to 608 or 832. If you should detect objects with different colors as the same class_id, increase hue=0.2 saturation=1.8 exposure=1.8

Also fix this mistake: https://github.com/AlexeyAB/darknet/blob/8b5344ee2dc551dbe673020a33021e7f84f305f1/cfg/yolov3-tiny.cfg#L175 mask = 0,1,2

saihv commented 6 years ago

Thanks! I'm using Tiny YOLO v2 right now at 640x352 (all images, training and test are at 640x360) because of FPS requirements. I'll try changing random and jitter.

On Tue, May 15, 2018, 9:37 AM Alexey notifications@github.com wrote:

the test data might be more challenging, or perhaps it has a different distribution of images per class etc.

Yes.

So that makes me curious, are there any tips or tricks to making a model more generalized?

Increase params in the data augmentation and train 10x times more iterations. random=1 jitter=0.4 increase width and height to 608 or 832. If you should detect objects with different colors as the same class_id, increase hue=0.2 saturation=1.8 exposure=1.8

Also fix this mistake: https://github.com/AlexeyAB/darknet/blob/8b5344ee2dc551dbe673020a33021e7f84f305f1/cfg/yolov3-tiny.cfg#L175 mask = 0,1,2

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexeyAB/darknet/issues/406#issuecomment-389232346, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKz1oIKV6NniMEnONR6G-znTTtC_06Cks5tywQwgaJpZM4SSLnq .

AlexeyAB commented 6 years ago

If you want to use random=1 with non-square network 640x352 then you should download the latest version of Darknet from this GitHub repository.

Also did you re-calculate anchors? You can do it too for -width 20 -height 11 https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

saihv commented 6 years ago

I tried regenerating anchors in the past through this command:

gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 5

But using those anchors actually decreased the IoU. I now see that I should probably try with those width and height arguments (net.w/32 and net.h/32 I guess?)

AlexeyAB commented 6 years ago

@saihv Set here: https://github.com/AlexeyAB/darknet/blob/8b5344ee2dc551dbe673020a33021e7f84f305f1/scripts/gen_anchors.py#L17-L18

 width_in_cfg_file = 640. 
 height_in_cfg_file = 352.

saihv commented 6 years ago

Oops, I should have looked at that! Thanks for pointing it out, will change it and try.

saihv commented 6 years ago

I tried training with random=1, which produces slightly lower validation IoU (avg. 75% vs 80% with random=0); but most of the inaccuracy comes from classes that have relatively smaller objects (I did include small_object=1 in the cfg file; but the objects are not smaller than 1% pixels, so I don't know if this parameter helps)

Would it be helpful to train at a higher resolution (more than 640x360 but still non-square) and random=0 but do inference at 640x352?

AlexeyAB commented 6 years ago

@saihv What mAP can you get for random=1 and random=0? Usually training resolution should be ~the same as detection resolution, if the images in the training and detection dataset have the same resolution.

saihv commented 6 years ago

random=0:

detections_count = 35981, unique_truth_count = 13965 
class_id = 0, name = boat, 981   ap = 100.00 % 
class_id = 1, name = building,   ap = 79.95 % 
class_id = 2, name = car,    ap = 90.91 % 
class_id = 3, name = drone,      ap = 90.91 % 
class_id = 4, name = group,      ap = 80.07 % 
class_id = 5, name = horseride,      ap = 90.91 % 
class_id = 6, name = paraglider,     ap = 100.00 % 
class_id = 7, name = person,     ap = 90.91 % 
class_id = 8, name = riding,     ap = 90.91 % 
class_id = 9, name = truck,      ap = 72.41 % 
class_id = 10, name = wakeboard,     ap = 83.83 % 
class_id = 11, name = whale,     ap = 100.00 % 
 for thresh = 0.24, precision = 0.97, recall = 0.98, F1-score = 0.98 
 for thresh = 0.24, TP = 16597, FP = 453, FN = 326, average IoU = 81.34 % 

 mean average precision (mAP) = 0.892338, or 89.23 %

random=1:

detections_count = 38874, unique_truth_count = 13965  
class_id = 0, name = boat, 874   ap = 90.91 % 
class_id = 1, name = building,   ap = 66.27 % 
class_id = 2, name = car,    ap = 90.89 % 
class_id = 3, name = drone,      ap = 90.53 % 
class_id = 4, name = group,      ap = 59.67 % 
class_id = 5, name = horseride,      ap = 90.63 % 
class_id = 6, name = paraglider,     ap = 100.00 % 
class_id = 7, name = person,     ap = 90.89 % 
class_id = 8, name = riding,     ap = 90.84 % 
class_id = 9, name = truck,      ap = 69.11 % 
class_id = 10, name = wakeboard,     ap = 80.74 % 
class_id = 11, name = whale,     ap = 90.87 % 
 for thresh = 0.25, precision = 0.95, recall = 0.95, F1-score = 0.95 
 for thresh = 0.25, TP = 13223, FP = 658, FN = 742, average IoU = 75.12 % 

 mean average precision (mAP) = 0.842781, or 84.28 %

Please note the difference in AP for classes 1, 4 and 9: which are the challenging ones with smaller object sizes. Both configurations were trained for about ~120k iterations after which the mAP settles and does not change by much.

AlexeyAB commented 6 years ago

@saihv Try to change these lines: https://github.com/AlexeyAB/darknet/blob/4403e71b330b42d3cda1e0721fb645cf41bac14f/src/detector.c#L132-L134 to these:

            float random_val = rand_scale(1.4); // *x or /x
            int dim_w = roundl(random_val*init_w / 32) * 32;
            int dim_h = roundl(random_val*init_h / 32) * 32;

And train with random=1, what mAP will you get?

saihv commented 6 years ago

Trained it for 120k iterations with those changes, and now the mAP is pretty close to random=0:

detections_count = 30511, unique_truth_count = 13965  
class_id = 0, name = boat, 511   ap = 90.91 % 
class_id = 1, name = building,   ap = 82.86 % 
class_id = 2, name = car,    ap = 90.91 % 
class_id = 3, name = drone,      ap = 90.91 % 
class_id = 4, name = group,      ap = 78.91 % 
class_id = 5, name = horseride,      ap = 100.00 % 
class_id = 6, name = paraglider,     ap = 100.00 % 
class_id = 7, name = person,     ap = 90.90 % 
class_id = 8, name = riding,     ap = 90.91 % 
class_id = 9, name = truck,      ap = 72.62 % 
class_id = 10, name = wakeboard,     ap = 88.94 % 
class_id = 11, name = whale,     ap = 100.00 % 
 for thresh = 0.25, precision = 0.97, recall = 0.98, F1-score = 0.97 
 for thresh = 0.25, TP = 13632, FP = 416, FN = 333, average IoU = 80.81 % 

 mean average precision (mAP) = 0.898222, or 89.82 %

But I guess because random=1 switches between low and high resolutions, it might be beneficial to train for more iterations.

AlexeyAB commented 6 years ago

But I guess because random=1 switches between low and high resolutions, it might be beneficial to train for more iterations.

Yes. random=1 is almost the same as if there will be 2x more images, so it requires 2x more itrations.

What jitter do you use in all these cases?

saihv commented 6 years ago

I am still using jitter=0.2. I remember one of your suggestions was to move to 0.4, but I was just testing one thing at a time, so that's next on my list.

AlexeyAB commented 6 years ago

Yes, better to test one thing at a time. Changing jitter from 0.2 to 0.4 requires about 5-10x times more iterations.

AlexeyAB / darknet

Tiny YOLO: Looking for suggestions to improve training on a custom dataset #406

Yes.