david8862 / keras-YOLOv3-model-set

end-to-end YOLOv4/v3/v2 object detection pipeline, implemented on tf.keras with different technologies
MIT License
639 stars 220 forks source link

'loss: nan' error while training with standard yolo_loss #17

Open gillmac13 opened 4 years ago

gillmac13 commented 4 years ago

Hi David,

I just want to report a glitch in my experiments... I am training models (my own dataset = 27,000 annotations, 1 class) with the following cmd line:

python3 train.py --model_type yolo3_mobilenetv2_lite --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --save_eval_checkpoint --batch_size 16 --eval_online --eval_epoch_interval 3 --transfer_epoch 2 --freeze_level 1 --total_epoch 20

This is just an example, I tried a half dozen combinations of backbones and heads... Out of 10 trials, I only managed to reach epoch=20 twice. In the other cases, at some point (usually around epoch 4 to 9) I get a crash with this typical message:

705/1106 [==================>...........] - ETA: 7:54 - loss: 9.8939 - location_loss: 3.5176 - confidence_loss: 4.8495 - class_loss: 0.0014Batch 705: Invalid loss, terminating training

706/1106 [==================>...........] - ETA: 7:52 - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 252, in <module>

For the record, I work in Ubuntu 18.04, with tf2.1 and I pulled the lastest commits from your repo.

So I switched to 'use_diou_loss=True' and so far all is fine, with a much better convergence than previously. This looks to be a very helpful addition !

Gilles

david8862 commented 4 years ago

Hi @gillmac13, thanks a lot. Acturally DIoU loss support is still under experiement, and from the error log, all the 3 parts of loss (loc, conf, cls) fall to nan together, which is quite weird. One possible solution for that is extend the transfer learning stage to get a more stable head first, and then move to fine tune stage. But anyway, it's awesome that new loss could help training :)

gillmac13 commented 4 years ago

OK I tried that, and it seems that (on my small dataset), setting the number of transfer epochs to 5 is enough to have a working set of head parameters and avoid nan loss issues. In fact the loss value kind of plateau after 3 epochs of transfer training. However, using the new diou loss seems to produce a stable head as soon as the 2nd epoch, and as I was telling you, faster and better convergence in the full training phase.

david8862 commented 4 years ago

OK I tried that, and it seems that (on my small dataset), setting the number of transfer epochs to 5 is enough to have a working set of head parameters and avoid nan loss issues. In fact the loss value kind of plateau after 3 epochs of transfer training. However, using the new diou loss seems to produce a stable head as soon as the 2nd epoch, and as I was telling you, faster and better convergence in the full training phase.

Got it. Fast convergence is just one of the main lightspots for DIoU loss in paper. Glad to see we can pick it in our implementation.

farhodbekshamsiyev commented 4 years ago

OK I tried that, and it seems that (on my small dataset), setting the number of transfer epochs to 5 is enough to have a working set of head parameters and avoid nan loss issues. In fact the loss value kind of plateau after 3 epochs of transfer training. However, using the new diou loss seems to produce a stable head as soon as the 2nd epoch, and as I was telling you, faster and better convergence in the full training phase.

Hi can you help me I am also getting loss:nan in 11 out of 100 epoch with following command. How can I properly finish training without nans? Error: 438/441 [============================>.] - ETA: 20s - loss: 25.9218 - location_loss: 5.7002 - confidence_loss: 12.1644 - class_loss: 2.0421Batch 438: Invalid loss, terminating training 439/441 [============================>.] - ETA: 13s - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 258, in .............. .......... .......... File "/home/f/Venv/py_3.7/lib/python3.7/site-packages/tensorflow_core/python/keras/callbacks.py", line 1055, in _get_file_path return self.filepath.format(epoch=epoch + 1, **logs) KeyError: 'val_loss'

Command: python train.py --model_type=yolo4_mobilenet --anchors_path=configs/yolo4_anchors.txt --annotation_file=voc_train.txt --classes_path=configs/voc_classes.txt --model_image_size=416x416 --learning_rate=0.001 --transfer_epoch=10 --init_epoch=0 --total_epoch=100 --eval_online --eval_epoch_interval=5 --save_eval_checkpoint

Also my env is Lenovo Y7000P ram 16gb gpu 6gb cpu i7 linux cuda 10.2 tf 2.1.0

and also which set of losses(values) Can we accept as appropriate(most expectable and normal)?

farhodbekshamsiyev commented 4 years ago

@david8862

What do you think about this nans?

tabmoo commented 4 years ago

I solve the problem of NaNs by inserting this to postprocess.py:

`box_xy = feats[..., :2]`
`box_wh = feats[..., 2:4]`
`box_xy = tf.where(box_xy < -10.0, -10.0, box_xy)`
`box_xy = tf.where(box_xy > 10.0, 10.0, box_xy)`
`box_wh = tf.where(box_wh < -8.0, -8.0, box_wh)`
`box_wh = tf.where(box_wh > 8.0, 8.0, box_wh)`
farhodbekshamsiyev commented 4 years ago

box_xy

In which line you put it?

tabmoo commented 4 years ago

box_xy

In which line you put it?

At line 24.

david8862 commented 4 years ago

@farhodbekshamsiyev, generally there're 2 way to avoid this issue. One is adding more transfer epochs to make the YOLO head stable before unfreeze the backbone. The other is trying to use SGD optimizer(default no momentum), which only think about the gradient and may not easy to fall to NaN loss

farhodbekshamsiyev commented 4 years ago

@farhodbekshamsiyev, generally there're 2 way to avoid this issue. One is adding more transfer epochs to make the YOLO head stable before unfreeze the backbone. The other is trying to use SGD optimizer(default no momentum), which only think about the gradient and may not easy to fall to NaN loss

Which set of losses(values) Can we accept as appropriate(most expectable and normal)? loss: 22.7161 - location_loss: 5.5168 - confidence_loss: 8.7224 - class_loss: 0.9932? I want to know the best result! And I will try to reach this point iterating over and over that's why I am asking

david8862 commented 4 years ago

Hi @farhodbekshamsiyev , loss value varies for many different factors (model type/loss type/obj class number, etc.), so generally there's no "best losses" for everywhere. Actually the more important metric for detector training is mAP, which you can check with "--eval_online" and related options during training.

farhodbekshamsiyev commented 4 years ago

Hi @farhodbekshamsiyev , loss value varies for many different factors (model type/loss type/obj class number, etc.), so generally there's no "best losses" for everywhere. Actually the more important metric for detector training is mAP, which you can check with "--eval_online" and related options during training.

Thank you very much for reply.

farhodbekshamsiyev commented 4 years ago

Hi @farhodbekshamsiyev , loss value varies for many different factors (model type/loss type/obj class number, etc.), so generally there's no "best losses" for everywhere. Actually the more important metric for detector training is mAP, which you can check with "--eval_online" and related options during training.

Hi @david8862 it's again me @farhodbekshamsiyev

During this epochs i got this results: mAP Predicted_Objects_Info Recall

Precision

Ground-Truth_Info

Please share your ideas to reach the epoch 100/100 without problems!!! Which batch size can I take in order to avoid killing ?(8 maybe) Which yolo4 model should I use? Should I use --data_shuffle cmd?

Thanks in advance !!!

david8862 commented 4 years ago

@farhodbekshamsiyev now this repo support different backbones (MobileNetV1, MobileNetV3Large/Small, Efficientnet) for the YOLOv4 arch, and also Tiny & Lite options for lightweight model structure. Maybe you can try them for training to avoid mem exhausted. "--data_shuffle" is just for a cross validation on the train/val samples.

farhodbekshamsiyev commented 4 years ago

@farhodbekshamsiyev now this repo support different backbones (MobileNetV1, MobileNetV3Large/Small, Efficientnet) for the YOLOv4 arch, and also Tiny & Lite options for lightweight model structure. Maybe you can try them for training to avoid mem exhausted. "--data_shuffle" is just for a cross validation on the train/val samples.

Thank you very much!!! I will try all of them.

farhodbekshamsiyev commented 4 years ago

Hi David,

I just want to report a glitch in my experiments... I am training models (my own dataset = 27,000 annotations, 1 class) with the following cmd line:

python3 train.py --model_type yolo3_mobilenetv2_lite --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --save_eval_checkpoint --batch_size 16 --eval_online --eval_epoch_interval 3 --transfer_epoch 2 --freeze_level 1 --total_epoch 20

This is just an example, I tried a half dozen combinations of backbones and heads... Out of 10 trials, I only managed to reach epoch=20 twice. In the other cases, at some point (usually around epoch 4 to 9) I get a crash with this typical message:

705/1106 [==================>...........] - ETA: 7:54 - loss: 9.8939 - location_loss: 3.5176 - confidence_loss: 4.8495 - class_loss: 0.0014Batch 705: Invalid loss, terminating training

706/1106 [==================>...........] - ETA: 7:52 - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 252, in <module>

For the record, I work in Ubuntu 18.04, with tf2.1 and I pulled the lastest commits from your repo.

So I switched to 'use_diou_loss=True' and so far all is fine, with a much better convergence than previously. This looks to be a very helpful addition !

Gilles

Hi @gillmac13 Right now I am training a 4 class model with 2007+2012 trainval VOC dataset and it is the second day of training. Can you help me with choosing appropriate hyperparameters for training. The command which I chose:

python train.py --model_type=yolo4_mobilenet --anchors_path=configs/yolo4_anchors.txt --annotation_file=voc_train.txt --classes_path=configs/voc_classes.txt --model_image_size=416x416 --learning_rate=0.001 --batch_size=32 --transfer_epoch=15 --optimizer=sgd --init_epoch=0 --total_epoch=75 --eval_online --eval_epoch_interval=5 --save_eval_checkpoint

It is 45/75 epoch and the loss is not changing : val_loss did not improve from 19.37943 what do you think? should I stop training or wait a day to finish training?

gillmac13 commented 4 years ago

Hi @farhodbekshamsiyev

It has been a while, and I haven't tried the newest yolo v4 version, but the model type which worked best for me (1 class underwater object recognition) was clearly yolo3_spp. And since I needed the speed and compactness, I found mobilenetV2_lite very effective. Out of 7 different combinations of backbones and yolo versions, this choice is the clear winner (for my application). Since yolo v4 also uses the spp feature, I suppose it must be good or better. I intend to train on the combo yolo4+mobilenetv3 very soon. Anyway this is what I did (please note that the batch size of 16 is required as my GPU has a storage capacity of 8 GB): In what follows, my_yolo_class.txt has only 1 class, and my images are all 416x416x3

Precautions: before launching a training command line, and to avoid crashes... 1- I switched to diou_loss (and nms_diou) by setting to true "use_diou_loss" in /yolo3 /loss.py at line 230 (?) 2- In /common/utils.py, line 20(?), I changed the memory_limit = 7000 3- I disabled the Mixed Precision Training at the beginning of train.py

Command line: $ python3 train.py --model_type yolo3_mobilenetv2_lite_spp --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --batch_size 16 --transfer_epoch 4 --freeze_level 1 --total_epoch 40 --optimizer rmsprop --decay_type cosine

Also note that I switched from adam to rmsprop. I am not sure which of the changes from standard training really helped, but since it worked for me, I am happy! With 27000 images in my dataset, I found that after 40 epochs the total loss did not change anymore, therefore I stopped training at that point. Using the model on my test set was pretty good, so I suppose I did it right.

I hope it helps...

farhodbekshamsiyev commented 4 years ago

Hi @farhodbekshamsiyev

It has been a while, and I haven't tried the newest yolo v4 version, but the model type which worked best for me (1 class underwater object recognition) was clearly yolo3_spp. And since I needed the speed and compactness, I found mobilenetV2_lite very effective. Out of 7 different combinations of backbones and yolo versions, this choice is the clear winner (for my application). Since yolo v4 also uses the spp feature, I suppose it must be good or better. I intend to train on the combo yolo4+mobilenetv3 very soon. Anyway this is what I did (please note that the batch size of 16 is required as my GPU has a storage capacity of 8 GB): In what follows, my_yolo_class.txt has only 1 class, and my images are all 416x416x3

Precautions: before launching a training command line, and to avoid crashes... 1- I switched to diou_loss (and nms_diou) by setting to true "use_diou_loss" in /yolo3 /loss.py at line 230 (?) 2- In /common/utils.py, line 20(?), I changed the memory_limit = 7000 3- I disabled the Mixed Precision Training at the beginning of train.py

Command line: $ python3 train.py --model_type yolo3_mobilenetv2_lite_spp --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --batch_size 16 --transfer_epoch 4 --freeze_level 1 --total_epoch 40 --optimizer rmsprop --decay_type cosine

Also note that I switched from adam to rmsprop. I am not sure which of the changes from standard training really helped, but since it worked for me, I am happy! With 27000 images in my dataset, I found that after 40 epochs the total loss did not change anymore, therefore I stopped training at that point. Using the model on my test set was pretty good, so I suppose I did it right.

I hope it helps...

@gillmac13 Thanks for your reply. You saved my time and gave a usefull information about training! Thank you very much!

mvanlierBCG commented 4 years ago

@david8862 @gillmac13 Using training with one class, adam lr=0.001, decay_type = None, freeze_level=1, and default yolov3 loss I get already Nan after the 4the epoch in transfer training. Do you have any suggestions to avoid Nan's in the transfer training?

david8862 commented 4 years ago

@mvanlierBCG what's your model type? For large models (like Xception/Efficientnet family) you can try sgd optimizer with lr=0.01/0.005 to get quick convergence in transfer stage and avoid nan loss

yakhyo commented 3 years ago

I am having the same error after almost one and half year @david8862. Is there any exact reason for this kind of Gradient Exploding? I think this nan comes from gradient exploding. Doesn't It? Would be great if you share your experience on it. Accually I am using YOLOv1