experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
MIT License
1.73k stars 784 forks source link

comparing weights #291

Open rodrigo2019 opened 6 years ago

rodrigo2019 commented 6 years ago

Hello, I did some tests and I would like to share I trained 4 models comparing the results between the same model with pre trained weights and without. My configuration for this tests was:

Batch Size = 4 Image Size = 224x224 Dataset = VOC Epochs = 153 (150 + 3 warmup)

Full Yolo:

MAP image

LOSS image

ps: After epoch 42 the model without pre trained weights started to overfit, even getting a worst result than a model with pre trained weights Why the model without pre trained weights got a lot of Nan values in validation?

Tiny yolo

MAP: image

LOSS image

In tiny yolo study case is possible to see that the model without pre trained weights get almost the same result as pre trained weights, and converged more quickly. In this case, the pre trained weights could be wrong? Maybe was uploaded a wrong model?

General questions:

  1. This pre trained weights was trained in the imagenet dataset (that one with 155gb and 1.2kk images)?
  2. The model was trained for classification?
  3. What was the configuration of the training?
  4. How many epochs was trained and how much hours took?
  5. In this implementation, Why IOU threshold for MAP validation is 0.3 and not 0.5 as usual?
Axel13fr commented 6 years ago

Hi Rodrigo, No answers from me, just a side question: how did you get these metrics into tensorboard ? I've had difficulties getting them through the tensorboard callback. Any code available doing that ? Or did you get the mAP by adding a keras custom metrics ? Thx

rodrigo2019 commented 6 years ago

Hi @Axel13fr , you can check my fork. I did some changes in this repository, I'am waiting @experiencor answer my first PR to open more PR with these modifications

In my fork actually you can:

  1. train with different dimensions for HxW
  2. train in grayscale
  3. Create a custom backend
  4. mAP is avaliable on tensorboard callback for default
  5. parse CSV annotations
  6. If no pre trained weights are found, the model will use a fresh model instead and will not raise a error as in this repo
  7. create a backup for each training (useful to compare models)
  8. get inference model, excluding any dependency from the class "YOLO"

I would like to suggest all this modification for this repo, but the owner needs to answer me first

Axel13fr commented 6 years ago

Hi Rodrigo, thanks, I cherry picked your mAP Tensorboard callback.

Regarding your questions, I believe you have used the .h5 backend files right ? Have you tried using the weights from the darknet implementation to train your network ? This is what's done in the Jupyter Notebook using the WeightReader utils.

As for the NaN values, initialization of the weights is very sensitive for large networks, you may want to try out something different like https://www.tensorflow.org/api_docs/python/tf/contrib/layers/xavier_initializer

rodrigo2019 commented 6 years ago

Regarding your questions, I believe you have used the .h5 backend files right ?

Yes, I'am

Have you tried using the weights from the darknet implementation to train your network ?

I didn't know about the Jupyter, I will test in the next week (now I'm testing retinaNet) and post here the results

As for the NaN values, initialization of the weights is very sensitive for large networks, you may want to try out something different like https://www.tensorflow.org/api_docs/python/tf/contrib/layers/xavier_initializer

I will check the xavier_initializer.

Thanks @Axel13fr

rodrigo2019 commented 6 years ago

I did more tests, now with my particular dataset, and I got some interesting results.

In epoch 25 was the best val_loss of the training, but the mAP was 0 bestloss image

In epoch 256 was the best mAP, and the val_loss was much higher than the best one. bestmap image

It's look like is not a good idea to use the model with best val_loss instead best mAP Also with batch_size = 32, I got really worst results than batch_size = 4

Axel13fr commented 6 years ago

Interesting. Was it done with Full or Tiny Yolo ? Did you do a warmup before training ? Also, have you tried to freeze some of the layers, specially during the warmup phase ? This can help to initialize the last layer of your network without causing a large gradient during training (which would affect the feature extractor badly).

rodrigo2019 commented 6 years ago

@Axel13fr , in this test I'm using a custom backend, because I would like to run at high fps on cpu, so I customized a network based on Tiny Darknet (not Tiny Yolo), so my training is starting from scratch. For now is too painfull to me for create pre trained weights for these custom backend, because I'm not sure if the network design will be the best choice. Also I don't known what is the best choice to generate pre trained weights for the backend, probaly I will try with autoencoders, because it looks more easier to train, but again, I don't know if is the best choice.

Did you do a warmup before training ?

Yes, 3 epochs

Also, have you tried to freeze some of the layers, specially during the warmup phase ? This can help to initialize the last layer of your network without causing a large gradient during training (which would affect the feature extractor badly).

It's look a very nice idea, I'm never imagined doing that, Have you ever to do that?

about the batch size, looking the discussion here in the retinanet, this discussion show that doesn't matter the size of yout batch, your results always will be closer of each other. It will be always truth? This loss function is exactly the same as the original? I didn't see the penalize factors on the original paper as here we have

In the paper says:

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

I will start putting a dropout layer in the model, but first connected layer means first convolution layer?

Axel13fr commented 6 years ago

@rodrigo2019 this repo implements Yolo V2, which is a bit different from the original version you mention in the paper. Specifically, it uses a conv layer at the end of the network instead of fully connected layers + it uses batch normalization which is a great method to reduce over-fitting (this is why you don't see the dropout anymore in V2).

Still though, I'm trying out a lot of stuff to train tiny yolo properly on one class, including weight freezing. It's really easy to do if you want to try, just set trainable=false for a given layer. You could do it for the couple of first conv layers as they are the least specific (do it for a complete block ie.: conv + batch norm + lrelu + MaxPool)

Example:


        x = Conv2D(16, (3,3), strides=(1,1), padding='same', name='conv_1', use_bias=False, trainable=trainable)(input_image)
        x = BatchNormalization(name='norm_1', trainable=trainable)(x)
rodrigo2019 commented 6 years ago

in this link has a good explanation about yolo

Are you successfull in your trainings?

Axel13fr commented 6 years ago

Thanks for the link, I'll take a look :+1:

I've managed to train Full Yolo properly on one class on the COCO dataset but I'm still trying to have tiny yolo working properly... it seems to overfit too early but I will try to see how the MAP evolves even if the validation error goes higher.

rodrigo2019 commented 6 years ago

by the way, on the PASCAL VOC Yolov2 got 70> on mAP evaluation, and in my tests with pre trained weights I got just 35 (with IOU threshold 0.3, with 0.5 probaly will be worst)

What was your results in the COCO dataset?

rodrigo2019 commented 6 years ago

image in the last 2 trains I got more than 5 hours of training resulting in mAP = 0

respective losses: image

Axel13fr commented 6 years ago

@rodrigo2019 I was training on a single class (boat) only on COCO, got 34 with Full Yolo, only 10 with Tiny, Yolo authors had 42 on 80 classes. It is worth noting though that the per class mAP can vary a lot (with 80+ for cats and as low as 22 for bottles for example on the first YOLO paper) so the average over classes can be missleading.

By the way this repo implementation never got to the published mAP because of missing training features and as well some unknown training tricks of the yolo authors (and possibly some bugs left here ??).

rodrigo2019 commented 6 years ago

Did you found somewhere about the mAP on Tiny Yolo? I didn't see anywhere

Today I will convert the weights from Yolov2 (original from darknet framework) 416x416 trained on VOC and I will apply the mAP validation from this repo to compare the results. Doing this I think we can get sure if the model and the validation is correct.

Axel13fr commented 6 years ago

MAP for tiny Yolo is on yolo darknet website. For tiny yolo I implemented loading the darknet weights from darknet website but the model they used was for voc dataset and it had 512 filters on last conv(the one before the randomly initialized final layer) instead of 1024 as in this implementation so I had to modify it. I will try the evaluation on voc to double check.

rodrigo2019 commented 6 years ago

~~Why is necessary randomize the last conv layer? following the jupyter tutorial I skiped the randomize part, and I got nan values for all predictions (I didnt trained). It is possible to use the model converting direct from darknet framework without training again?~~

Did not work converting weights from yolo-voc, but it worked for yolo-coco, I didn't know about the filters as you said, I thought was equal for boths, now it is working without training, thank you @Axel13fr

Axel13fr commented 6 years ago

So you got the correct mAP for coco and voc by loading the darknet weights?

As for the layer sizes, you can check and change them according to the cfg files from the darknet yolo site (cfg file for each network containing the topology).

rodrigo2019 commented 6 years ago

So you got the correct mAP for coco and voc by loading the darknet weights?

not yet, I need convert the json format, into xml or csv or create a json parser, but checking the results drawing the bounding box looks very good

As for the layer sizes, you can check and change them according to the cfg files from the darknet yolo site (cfg file for each network containing the topology).

I knew about this but didn't know about the difference, also in the cfg file I read this:

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

the same parameters used in this repo

if you want, you can download that conversion I did here

Axel13fr commented 6 years ago

I implemented the complete reading of darknet weights for full yolo and a mAP evaluation on COCO (without training) dataset gave me 45.32, a bit higher than 42.1. I think this is due in part to the obj_threshold at 0.3 instead of 0.5.

There is a high variance between mAP of different classes, below the details which gives a good hint of the kind of object Yolo is good at on COCO dataset (best is cats, worst is toaster): person 0.5003 bicycle 0.3720 car 0.3260 motorcycle 0.5293 airplane 0.7025 bus 0.6725 train 0.8455 truck 0.3366 boat 0.3018 traffic light 0.2383 fire hydrant 0.7284 stop sign 0.6543 parking meter 0.3811 bench 0.3031 bird 0.3458 cat 0.8638 dog 0.7310 horse 0.6656 sheep 0.5174 cow 0.5198 elephant 0.7914 bear 0.8676 zebra 0.7483 giraffe 0.8019 backpack 0.1488 umbrella 0.3683 handbag 0.0973 tie 0.4003 suitcase 0.4296 frisbee 0.5951 skis 0.3044 snowboard 0.4185 sports ball 0.3039 kite 0.3474 baseball bat 0.3975 baseball glove 0.3562 skateboard 0.5641 surfboard 0.4926 tennis racket 0.5807 bottle 0.2178 wine glass 0.2765 cup 0.3061 fork 0.2703 knife 0.1550 spoon 0.1319 bowl 0.3513 banana 0.3014 apple 0.2374 sandwich 0.4844 orange 0.2872 broccoli 0.3941 carrot 0.2201 hot dog 0.3819 pizza 0.6292 donut 0.5248 cake 0.4822 chair 0.2927 couch 0.5671 potted plant 0.3471 bed 0.6953 dining table 0.4455 toilet 0.8429 tv 0.6970 laptop 0.6584 mouse 0.4815 remote 0.2257 keyboard 0.5966 cell phone 0.3244 microwave 0.6677 oven 0.5966 toaster 0.0525 sink 0.4845 refrigerator 0.6732 book 0.0910 clock 0.5997 vase 0.3862 scissors 0.4191 teddy bear 0.6039 hair drier 0.0580 toothbrush 0.2469

rodrigo2019 commented 6 years ago

really nice, could you share your code? I am interested how you read the coco annotations

rodrigo2019 commented 6 years ago

I am trying to implement the multi-scale feature, but it looks not too easy I didn't understand why some weights need to be initilized in this way

        # initialize the weights of the detection layer
        layer = self.model.layers[-4]
        weights = layer.get_weights()

        new_kernel = np.random.normal(size=weights[0].shape)/(self.grid_h*self.grid_w)
        new_bias   = np.random.normal(size=weights[1].shape)/(self.grid_h*self.grid_w)

        layer.set_weights([new_kernel, new_bias])

the kernel and bias depends on grid size, and with a multi-scale the yolo model will result in a variable grid size, so I don't know how to initialize these weights.

the function loss I was expecting more work, but was quite easy to transform it to work with multi-scale, on the other hand changing the generator function is more complicated than I expected.

A fast and dirty way to implement multiscale is using the imgaug library, but in this way I think it is not safe to make zoom in in the image, because probaly in some images you may be get problem with the annotations thats goes out in the zoom area, but making just zoom out looks safe I already implemented it, you can check in this branch if you want, but I didn't tested yet. It is very similar from the original, but now the jitter is heavier and I implemented the jitter on the Keypoints from the annotations using the same library, and not doing it manually as the original implementation

Axel13fr commented 6 years ago

@rodrigo2019 for the COCO dataset, the easiest is to convert it to VOC format using https://gist.github.com/chicham/6ed3842d0d2014987186#file-coco2pascal-py so that this repo can use it directly.

About the random init: this applies only to the very last layer which is supposed to be customized for your application as it depends on your settings like grid size classes etc. The rest of the architecture remains unchanged so you can do transfer learning (i.e: use pretrained weights instead of starting from zero).

In my case, as I used COCO dataset, it's the same setup as for the darknet weights so I could init this last layer with the darknet weights instead of randomly init them to do a new training.

rodrigo2019 commented 6 years ago

but this initialize is done even in the model from frontend here, when the backend weights is loaded this weights dont affect the layer which is done this init, and if you load a whole pre trained model, the pre trained weights overwrites this initilize. why the default weights from keras is not good enough?

thank you about the converter coco2pascal, it will be usefull to me

Axel13fr commented 6 years ago

when the backend weights is loaded this weights dont affect the layer which is done this init

That's right, this doesn't affect the front end so the frond end needs to be initialized with something appropriate to start training it from scratch and this is what this random init is doing (leaving it uninitialized will screw the training)

if you load a whole pre trained model, the pre trained weights overwrites this initilize Correct, that's when you do fine tuning: you already have your full network trained and you want to adjust it a bit so you start from pretrained.

The random init is for transfer learning: you start with a backend trained for feature extraction (usually on ImageNET) and you will plug to that a front end to solve your specific problem.

rodrigo2019 commented 6 years ago

there is some place that I can study it? Because I'm not understanding. I understand to randomize your weights if some weight is loaded in this layer, but the point is no weight is loaded in this specific layer previously

Axel13fr commented 6 years ago

That's for the case where no weights are loaded : then what are the values of uninitialized variables of this layer ?
0 would be a terrible case. High values could be as well. So better not leave that undefined: simply use an init method, random could be one of them.

There are many sources about this subject which you may want a look at. For example https://stats.stackexchange.com/questions/200513/how-to-initialize-the-elements-of-the-filter-matrix?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

rodrigo2019 commented 6 years ago

keras uses for default a initilizer called "glorot_uniform", but for this layer the kernel initializer is "lecun_normal".

        # make the object detection layer
        output = Conv2D(self.nb_box * (4 + 1 + self.nb_class), 
                        (1,1), strides=(1,1), 
                        padding='same', 
                        name='DetectionLayer', 
                        kernel_initializer='lecun_normal')(features)

thank you

Axel13fr commented 6 years ago

You're absolutely right, I overlooked that. Seems like this random uniform call is useless after the kernel initializer from keras!

rodrigo2019 commented 6 years ago

maybe @experiencor was using lecun_normal initializer before implementing this new initializer which maybe has in the original implementation on darknet framework

lichacha commented 6 years ago

if i want change the net of full yolo, but i don't have the corresponding backend.h5 .what can i do for it? Read through the code,I find that users can't seem to modify the network of full yolo or tiny yolo. aaaaaaa What should I do?

rodrigo2019 commented 6 years ago

hi @lichacha, you can use my fork, in my fork please read the readme.md, there is a explanation how to use a custom backend.

bkanaki commented 6 years ago

@rodrigo2019 @Axel13fr Do you guys have the weights converted for the Tiny-YOLO v2 model from darknet to this framework?

rodrigo2019 commented 6 years ago

@bkanaki just for full yolo on coco, but you can create this model using the tools provided in this repo. If you do, please share with us, I'am uploading the models here

bkanaki commented 6 years ago

I somewhere read on the issues that it is difficult to reproduce results for Tiny-Yolo model. The darknet website reports it to be 23% on the COCO dataset. I will see if I can get good results and then I can post whenever I get time.

bkanaki commented 6 years ago

Hi @Axel13fr @rodrigo2019 I have a one more question for you guys:

When you tried training from scratch, i.e., without using the pre-trained weights, did you ever see the loss going to nan? For me, from very beginning it is nan and doesn't change if I train only on the images with person from coco dataset.

@rodrigo2019 , you had a question if the loss implemented here is the exact loss as mentioned in the paper. I think this loss doesn't take the square root of the width and height predictions but the original paper mentions it.

Even the snippet in the Step-By-Step notebook, the author of this repo (@experiencor) mentions square root, but I don't see it in the actual definition. Any reason why? Or is it overlooked?

rodrigo2019 commented 6 years ago

Yes I got nan values in some training, even training from scratch. Did you tried putting the square root?

bkanaki commented 6 years ago

Yes, I put square root. When debugging the loss function, I noticed that my loss was blowing up because of the loss_wh. That is when I noticed that sqrt is missing. This brought down the loss quickly after warmup (I am aware of the +10 added to loss during warmup) than compared to without square root. My mAP also increased from 25% to 68% for my dataset.

However, the nan might also have to do with the different anchors, as someone has posted in the issues.

I still haven't attempted training from scratch, but I will as I have to do something very similar to what you did, i.e., reducing network size further.

rodrigo2019 commented 6 years ago

wow really interesting, could you share the code? I would like to try

bkanaki commented 6 years ago

Just add:

true_box_wh = tf.sqrt(true_box_wh)
pred_box_wh = tf.sqrt(pred_box_wh)

before line 215

rodrigo2019 commented 6 years ago

@bkanaki, I tested on my dataset and I got some improvements: image

In this week I will train on VOC dataset and compare the results, I think you found a important bug/mistake

Axel13fr commented 6 years ago

@bkanaki Thanks for the square root loss correction. I did write a function to read the darknet weights for Tiny Yolo. It's an adaptation of the existing one which reads Full Yolo darknet weights:

def read_weights(self):
    weight_reader = WeightReader(TINY_YOLO_OFFICIAL_WEIGHTS_PATH)
    weight_reader.reset()
    nb_conv = 9
    last_layer_with_batch_norm = 7
    # Only reads weights for the first 8 layers. The last one is init randomly
    # outside the constructor. This lets the user choose the side of the output :
    # number of boxes, classes
    for i in range(1, nb_conv+1): #1:8
        # For layer 1 to 7 : read batchnorm
        if i <= last_layer_with_batch_norm:
            norm_layer = self.model.get_layer("norm_" + str(i))

            size = np.prod(norm_layer.get_weights()[0].shape)

            beta  = weight_reader.read_bytes(size)
            gamma = weight_reader.read_bytes(size)
            mean  = weight_reader.read_bytes(size)
            var   = weight_reader.read_bytes(size)

            weights = norm_layer.set_weights([gamma, beta, mean, var])

        conv_layer = self.model.get_layer("conv_" + str(i))
        if len(conv_layer.get_weights()) > 1:
            # Read bias & weights
            bias = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[1].shape))
            kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
            kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
            kernel = kernel.transpose([2, 3, 1, 0])
            conv_layer.set_weights([kernel, bias])
        else:
            # Read only weights
            kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
            kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
            kernel = kernel.transpose([2, 3, 1, 0])
            conv_layer.set_weights([kernel])
jzx-gooner commented 6 years ago

@rodrigo2019 @Axel13fr Thanks for your share ! I want to train a yolo model with only one class. I have trained the model based on the Full Yolo and Tiny Yolo weights The result :

I have three quenstions:

rodrigo2019 commented 6 years ago

@jzx-gooner, yes you can achive real-time performance on CPU, I have a network for car detection that runs over 15fps on hardware like raspberry pi 3, on a notebook core i7 it runs over 150~200fps.

To do it, you must design a small backend with a small input size, like 50x50. But this network will not be good prediction small objects in the image

jzx-gooner commented 6 years ago

@jzx-gooner, yes you can achive real-time performance on CPU, I have a network for car detection that runs over 15fps on hardware like raspberry pi 3, on a notebook core i7 it runs over 150~200fps.

To do it, you must design a small backend with a small input size, like 50x50. But this network will not be good prediction small objects in the image

Thank your for your kindly reply. I used your repository and use the supertinyyolo which your design,i train the model from scratch。The model is little and fast,however the accuracy is low and three are lots of wrong judgement。

rodrigo2019 commented 6 years ago

I don't know how hard is your dataset, in my data all samples are very similar, although I got some false positives the network is working fine. Try to add some more filters in each layer

martinbel commented 6 years ago

@rodrigo2019 @Axel13fr @experiencor I've extended rodrigo's repo to support conversion of models to the NCS movidius stick. Take a look if you have some time. Thanks for sharing your work! I though it was simpler to make another project as I'm not a git expert and didn't want to make a mess. Project: https://github.com/martinbel/yolo2NCS

rodrigo2019 commented 6 years ago

@martinbel really nice, I have a few questions about your work. 1 - How many times the movidius stick runs faster than raspberry pi ? Are you using the pi3? 2 - How good is the accuracy of your model? (mAP)

ps: I think you can improve your network, I would try using a smaller image like 128x128, decrease the amount of filters in the firsts layers, and increase the filters in the last layers.

martinbel commented 6 years ago

Thanks!

1) If I run the same network with tensorflow in the raspberry it runs at ~2 fps. With the NCS it's running at ~20 fps, for it's price tag it's really great. Unfortunately it's sort of bizarre the compilation process, I actually spent more time making it work with the NCS than training models.

I've actually tried a model (for pedestrian detection) based on the SuperTinyYolo backend and it worked better than the darknet-reference model I uploaded. I think as it's a smaller network it was just simpler to train.

mAP: With the darknet-reference I was getting ~0.14, the super-tiny-yolo model had 0.18. I've trained with 30k images from the openimages dataset and COCO. It seems a low value, but the results were good enough for my application.

I've tried a few things that didn't work well:

Some general comments regarding this repo and your fork, perhaps you have an idea how to solve them:

rodrigo2019 commented 6 years ago

I think there might be a bug in the decode_netout, where NMS is done but I'm not sure. I'm refactoring it a bit, trying to avoid making massive changes.

probaly, few weeks agor I found a bug in the NMS code and fixed it, maybe there are more bugs

I wasn't able to make tensorboard work. Any ideas what could be the problem?

Are you using the back option in the config file? tensorboard is working fine to me.

So far I haven't been able to get "great" results by even training a detector with one class. I'll give your network suggestion a try.

I was able to get mAP higher than 70% for a car detection, but I used a very specific dataset with specific position for all samples, the coco dataset is much harder than my dataset.

martinbel commented 6 years ago

I've been trainign a model for 1 class and I was able to get 0.85 mAP. It seems to be generalizing well also. Here is the setup: