experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
MIT License
1.73k stars 785 forks source link

The model is automatically killing itself after 3 epochs #370

Open vdsprakash opened 5 years ago

vdsprakash commented 5 years ago

I am getting the Total Loss of about 10.032443234 starting , and after 3 epochs it is automatically killed.

Systerm Infomation:-

Processor:- Intel XEON RAM:- 8 GB Graphics: NVIDIA GTX-1060 6GB Python 3.6

robertlugg commented 5 years ago

The training will end when it can no longer improve. Please include the message (4-5 lines) towards the end of your output. It will give a clue as to the problem. But bottom line is that training has failed.

danFromTelAviv commented 5 years ago

I also get something similar - I am using python 3.6 , tensorflow + keras ( almost latest versions of each ). I am trying to train with squeezenet with the raccon dataset (I split it so that image 1-40 is my validation set). I had to change the congif file abit for it to run :

{
    "model" : {
        "backend":              "SqueezeNet",
        "input_size":           416,
        "anchors":              [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828],
        "max_box_per_image":    10,
        "labels":               ["raccoon"]
    },

    "train": {
        "train_image_folder":   "path/to/raccoon_dataset-master/raccoon_dataset-master/images/train/",
        "train_annot_folder":   "path/to/raccoon_dataset-master/raccoon_dataset-master/annotations/train/",

        "train_times":          10,
        "pretrained_weights":   "",
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epochs":             50,
        "warmup_epochs":        3,
        "object_scale":         5.0 ,
        "no_object_scale":      1.0,
        "coord_scale":          1.0,
        "class_scale":          1.0,
        "saved_weights_name":   "racoon_detector.h5",

        "debug":                true
    },

    "valid": {
        "valid_image_folder":   "path/to/raccoon_dataset-master/raccoon_dataset-master/images/val/",
        "valid_annot_folder":   "path/to/raccoon_dataset-master/raccoon_dataset-master/annotations/val/",

        "valid_times":          1
    }
}

1) The recall only decreases during the first 3 (warmup) iterations. typical print out for loss at this stage:

Loss XY [2.52920581e-05] Loss WH [0.0361119919] Loss Conf [0.12189471] Loss Class [0] Total Loss [10.1580324]

I am not sure how it gets to total loss of 10 ? 2) on the first real epoch I there are warning fro almost all batches and finally a failure :

if all(x == 1 for x in tup) and isinstance(A, _nx.ndarray): SystemError: error return without exception set Exception ignored in: <generator object tile.. at 0x000001B577C54E60> Traceback (most recent call last): File "C:\Users\dan\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\shape_base.py", line 1140, in if all(x == 1 for x in tup) and isinstance(A, _nx.ndarray): SystemError: error return without exception set raccoon 0.0000 mAP: 0.0000 Exception ignored in: <async_generator object _ag at 0x000001B39081CAD8> Traceback (most recent call last): File "C:\Users\dan\AppData\Local\Programs\Python\Python36\lib\types.py", line 27, in _ag

SystemError: error return without exception set raccoon 0.0000 mAP: 0.0000 Exception ignored in: <async_generator object _ag at 0x000001B39081CAD8> Traceback (most recent call last): File "C:\Users\dan\AppData\Local\Programs\Python\Python36\lib\types.py", line 27, in _ag SystemError: error return without exception set

Process finished with exit code -1

akhil451 commented 5 years ago

I am also getting an error along the same lines. Please share if anyone has figured out a solution to this. my config file:

{
    "model" : {
        "backend":              "Tiny Yolo",
        "input_size":           416,
        "anchors":              [0.38,0.29, 0.54,0.47, 0.78,0.58, 0.97,0.85, 1.94,1.83],
        "max_box_per_image":    10,        
        "labels":               ["Plate"]
    },

    "train": {
        "train_image_folder":   "train_img/",
        "train_annot_folder":   "annotations/",     

        "train_times":          8,
        "pretrained_weights":   "",
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epochs":            500,
        "warmup_epochs":        3,

        "object_scale":         5.0 ,
        "no_object_scale":      1.0,
        "coord_scale":          1.0,
        "class_scale":          1.0,

        "saved_weights_name":   "weights.h5",
        "debug":                true
    },

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",

        "valid_times":          1
    }
}

the error:

Epoch 00001: val_loss improved from inf to 10.00044, saving model to weights.h5
Traceback (most recent call last):
  File "train.py", line 101, in <module>
    _main_(args)
  File "train.py", line 97, in _main_
    debug              = config['train']['debug'])
  File "/home/sasuke/Downloads/keras-yolo2-master/frontend.py", line 401, in train
    max_queue_size   = 8)      
  File "/home/sasuke/anaconda3/envs/yolov2/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/sasuke/anaconda3/envs/yolov2/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/sasuke/anaconda3/envs/yolov2/lib/python3.6/site-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/sasuke/anaconda3/envs/yolov2/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/sasuke/Downloads/keras-yolo2-master/frontend.py", line 36, in on_epoch_end
    clear_output(wait=True)
NameError: name 'clear_output' is not defined
terminate called without an active exception
terminate called recursively
Aborted (core dumped)
akhil451 commented 5 years ago

Seems like the early stopping was the cause, you can either increase the value of patience(number of epochs) or remove that argument altogether.