experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
MIT License
1.73k stars 784 forks source link

training own data #17

Closed Kevin-Moon closed 7 years ago

Kevin-Moon commented 7 years ago

I am trying to train own data, i.e. scene text(road sign) detection.

Q1) I have 600 annotated data, but when I train with batch size = 16, there are ONLY 3~4 STEPS per a epoch.

see below report.

what's wrong with this? I expected 600/16 = 35 steps per a epoch.

Q2) I wonder why it does not work well when I don't use pre-trained weights. it just become explosion or meaningless oscillating.

Thank you very much in advance! :)

Epoch 1/50
1/3 [======>.......................] - ETA: 78s - loss: 52996.2266
2/3 [==============>...............] - ETA: 46s - loss: 26586.9480
3/3 [======================>.......] - ETA: 20s - loss: 17939.5313Epoch 00000: val_loss improved from inf to 69.81903, saving model to best_weights.h5

4/3 [===============================] - 383s - loss: 13474.0479 - val_loss: 69.8190
Epoch 2/50
1/3 [======>.......................] - ETA: 64s - loss: 110.6660
2/3 [==============>...............] - ETA: 41s - loss: 110.1645
3/3 [======================>.......] - ETA: 18s - loss: 120.9267Epoch 00001: val_loss did not improve

4/3 [===============================] - 390s - loss: 134.6366 - val_loss: 176.3403
Epoch 3/50
1/3 [======>.......................] - ETA: 70s - loss: 140.6994
2/3 [==============>...............] - ETA: 44s - loss: 131.1704
3/3 [======================>.......] - ETA: 19s - loss: 135.7138Epoch 00002: val_loss did not improve

4/3 [===============================] - 385s - loss: 155.2880 - val_loss: 157.5658
Epoch 4/50
1/3 [======>.......................] - ETA: 65s - loss: 79.9112
2/3 [==============>...............] - ETA: 42s - loss: 43.6893
3/3 [======================>.......] - ETA: 18s - loss: 31.8849Epoch 00003: val_loss improved from 69.81903 to 21.53916, saving model to best_weights.h5

4/3 [===============================] - 383s - loss: 33.7253 - val_loss: 21.5392
Epoch 5/50
1/3 [======>.......................] - ETA: 65s - loss: 15.6504
2/3 [==============>...............] - ETA: 41s - loss: 11.6136
3/3 [======================>.......] - ETA: 18s - loss: 9.0862
experiencor commented 7 years ago

Oops! The small size of the training set was due to the fact that I messed up the training set and the validation set in the previous release. It's fixed now in the latest version.

I guessed that it's very hard to learn from scratch using just 600 images. Your network contains many millions of parameters. These parameters must be pre-trained on some big number of images to learn general features of visual objects. Re-training on pre-trained parameters means learning new combinations of learned features to detect new objects.

PS: there's actually a meme about it, https://www.facebook.com/photo.php?fbid=1578743938842788&set=gm.1640513226012467&type=3&theater&ifg=1

Kevin-Moon commented 7 years ago

Aha! Thanks for your quick fix! :)

I will try again.

I have two small questions. 1) in other usual implementations of yolo2, I could see the parameter, "Subdivision" which determine the number of images to process in one time. is there also in yours?

2) is there any reference to understand the parameters(role or effect of each) below in detail?

thank you very much again for your sharing-

        "train_times":          10,
        "pretrained_weights":   "",
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epoch":             50,
        "warmup_batches":       100,

        "object_scale":         5.0 ,
        "no_object_scale":      1.0,
        "coord_scale":          1.0,
        "class_scale":          1.0
Kevin-Moon commented 7 years ago

It works! thanks-

but there was a 'debug' variable problem. ("can not find argument 'debug'")

I just deleted the new 'debug' variable. is it okay?

I think that the problem comes from config(json doesn't have debug parameter) file. I don't know why exactly.

now it is training as below. looks fine, but what does it mean by "DEBUG" notification?

does it just print out [loss_xy, loss_wh, loss_conf, loss_class, loss, current_recall, total_recall / seen] ?

and can we calculate where 301 steps comes from? I have only 600 data and batchsize = 16.

Thank you! ^^

poch 1/50
2017-10-07 11:37:39.229935: I tensorflow/core/kernels/logging_ops.cc:79] DEBUG[0.0023333447][1.0431395][0.12933666][1.1362211][2.3110304][0.19540229][0.19540229]
  1/301 [..............................] - ETA: 8704s - loss: 2.31102017-10-07 11:38:02.560623: I tensorflow/core/kernels/logging_ops.cc:79] DEBUG[0.0020245272][0.82074636][0.12540188][1.079408][2.0275807][0.275][0.23520115]

  2/301 [..............................] - ETA: 7925s - loss: 2.16932017-10-07 11:38:26.660540: I tensorflow/core/kernels/logging_ops.cc:79] DEBUG[0.0015974327][
experiencor commented 7 years ago

For subdivision, you mean a batch?

I have added the meaning of those parameters in the config.json file to the readme.

Missing 'debug' parameter is a bug. I fixed it. It's just a flag to turn on/off current losses and recall.

The formula is: 6000.8 (80% for training, the rest 20% for validation)/16 (batch size) 10 (train times, the number of times to cycle through the training set, useful for small datasets) = 300.

Kevin-Moon commented 7 years ago

Thanks a lot!! forget subdivision :)

my training images and Annotations are not size of 448 or 416.(640 480 instead) is it okay? especially i worry about coordinates of annotations.

Edit: "# fix object's position and size" this part of preprocessing.py is to adjust coordinates of non-standard to standard one(448), right? :)

and don't I need to update Anchors to detect some like characters? Roles of anchors?

finally how long does it take to train one class in full yolo? in my case it takes 1000sec per a epoch with 1080ti.

thanks to your kind project.

experiencor commented 7 years ago

All sizes will be resized to the standard size of (416,416). As a result, the original bounding box positions need to be fixed. This is achieved by the "# fix object's position and size" section.

In most cases, I find the default anchors are fined. But in cases, when the sizes of the objects are much smaller than the size of the image, the sizes of the anchors should be scaled down, and vice versa for the other case.

All things equal, training time only depends on the number of images, not number of classes. E.g., for coco dataset (83K images), it takes me 3 hours for each epoch. For the Raccoon dataset (200 images), it takes around 3 minutes (with train_times = 10). Note the train_times parameter, which defines the number of times to cycle through the training set in an epoch. I find it useful in small dataset like the Racoon one.

Kevin-Moon commented 7 years ago

3hours for a epoch! I see- Thanks for all your advices. :)

zafarmah92 commented 7 years ago

Hi , I am trying to re-run the algorithm over the same images that i has been trained. but receiving error , tried to debug but unable to pin-point the main cause of error, Is it because of old version of Keras ?

below are the details

python predict.py -c config.json -w tiny_yolo_features.h5 -i racoon-001.jpg

ValueError: You are trying to load a weight file containing 16 layers into a model with 17 layers.

Can you kindly elaborate why is it happening ?

Kevin-Moon commented 7 years ago

I faced the same error and I just changed it to 'he_normal'. Because it seemed like just a initialization.

      1. 오전 6:39에 "zfar Mahmood" notifications@github.com님이 작성:

Hi , I am trying to re-run the algorithm over the same images that i has been trained. but receiving error , tried to debug but unable to pin-point the main cause of error, Is it because of old version of Keras ?

below are the details

python predict.py -c config.json -w full_yolo_features.h5 -i racoon-001.jpg

Using TensorFlow backend.

Traceback (most recent call last): File "predict.py", line 72, in main(args) File "predict.py", line 50, in main anchors=config['model']['anchors']) File "/media/zfar/media files/CarProject/githubYoloClone/basic-yolo-keras/models.py", line 171, in init x = Conv2D(self.nb_box (4 + 1 + self.nb_class), (1,1), strides=(1,1), padding='same', name='conv_23', kernel_initializer='lecun_normal')(x) File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(args, kwargs) File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/layers/convolutional.py", line 455, in init kwargs) File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/layers/convolutional.py", line 110, in init self.kernel_initializer = initializers.get(kernel_initializer) File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/initializers.py", line 463, in get return deserialize(config) File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/initializers.py", line 455, in deserialize printable_module_name='initializer') File "/home/zfar/anaconda3/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 133, in deserialize_keras_object ': ' + class_name) ValueError: Unknown initializer: lecun_normal

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/experiencor/basic-yolo-keras/issues/17#issuecomment-337380208, or mute the thread https://github.com/notifications/unsubscribe-auth/AYAAVn42rnwiSFBixR9mQZ5eQXmStMF5ks5stR57gaJpZM4PwYji .

experiencor commented 7 years ago

@zfar- all the ***_features.h5 file contains the pre-trained weight parameters of all the layers except the last layers. That's why you see 1 layer missing when you try to load the files. These files will be automatically loaded by the model to all layers except the last one before training.

I do it this way because the size of the last layer depends on the number of classes you want to train. However, it creates a lot of confusion. I will edit the readme to include this piece of information.

zafarmah92 commented 7 years ago

@experiencor Thanks a lot , it would really clear many things not just for me but also for other who would have liked you work. Indeed i was getting close but I was unable to understand where to edit the layer to get it working