loss: NaN while training with my own dataset

noisewm commented 5 years ago

Hello,

I'm trying to migrate from qqwweeee's yolo implementation, but training with my own dataset results in loss: NaN (in qqwweee i can train just fine with same dataset). Already implemented basic null checks, and also my dataset goes through automated checks (so every image present, every box checked by its coordinates, class present in classes file) before training.

Already tried with different batch sizes -- same effect. Time for loss function to become NaN seems random (almost always in 1st epoch).

I think that problem somehow related to data generator... Maybe you can suggest a proper way to debug it?

image-size: [416, 416]
batch-size:
  bottlenecks: 8
  head: 48
  # the unfreeze model takes more memory
  full: 8
epochs:
  bottlenecks: 25
  head: 50
  full: 30
CB_learning-rate:
  factor: 0.01
  patience: 3
CB_stopping:
  min_delta: 0
  patience: 25
valid-split: 0.1
generator:
  augment: true
  resize_img: true
  nb_threads: 0.9
recompute-bottlenecks: false

python scripts/training.py --path_dataset train_annotations.txt --path_weights model_data\yolo_weights.h5 --path_anchors model_data/yolo_anchors.csv --path_classes model_data/custom_classes.txt --path_output logs/003 --path_config model_data/train_yolo.yaml

Borda commented 5 years ago

hello, it is probably not enough information to reproduce your problem... Can you try to run the following code which is tested, https://circleci.com/gh/Borda/keras-yolo3/265 ?

noisewm commented 5 years ago

Thx, training works just fine with VOC2007.

Seems that problem is in my dataset, will try to implement custom callback and look into the whole batch that results in NaN loss.

Borda commented 5 years ago

You do not have custom callback, you can just run debugger and place break-point at the end of generation an augmented image with bounding box...

Borda commented 5 years ago

@noisewm I believe that it is solved for now, but if you have any further question, feel free to reopen this issue...

Borda / keras-yolo3

loss: NaN while training with my own dataset #25