YunYang1994 / tensorflow-yolov3

🔥 TensorFlow Code for technical report: "YOLOv3: An Incremental Improvement"
https://yunyang1994.gitee.io/2018/12/28/YOLOv3-算法的一点理解/
MIT License
3.63k stars 1.36k forks source link

Train loss: nan Test loss: nan Saving #488

Open juanmanuelrq opened 4 years ago

juanmanuelrq commented 4 years ago

Hi,

Hi, I was training and.... nan..nan,

` => Epoch: 977 Time: 2020-03-03 10:50:58 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 978 Time: 2020-03-03 10:51:11 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 979 Time: 2020-03-03 10:51:30 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 980 Time: 2020-03-03 10:51:48 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 981 Time: 2020-03-03 10:52:03 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ...

my config.py file

! /usr/bin/env python

coding=utf-8

================================================================

Copyright (C) 2019 * Ltd. All rights reserved.

#

Editor : VIM

File name : config.py

Author : YunYang1994

Created date: 2019-02-28 13:06:54

Description :

#

================================================================

from easydict import EasyDict as edict

__C = edict()

Consumers can get config by: from config import cfg

cfg = __C

YOLO options

__C.YOLO = edict()

Set the class name

C.YOLO.CLASSES = "./data/classes/class.names" C.YOLO.ANCHORS = "./data/anchors/basline_anchors.txt" C.YOLO.MOVING_AVE_DECAY = 0.9995 C.YOLO.STRIDES = [8, 16, 32] C.YOLO.ANCHOR_PER_SCALE = 3 C.YOLO.IOU_LOSS_THRESH = 0.5 __C.YOLO.UPSAMPLE_METHOD = "resize" C.YOLO.ORIGINAL_WEIGHT = "./checkpoint/yolov3_coco.ckpt" C.YOLO.DEMO_WEIGHT = "./checkpoint/yolov3_coco_demo.ckpt"

Train options

__C.TRAIN = edict()

C.TRAIN.ANNOT_PATH = "./data/dataset/visdrone_train.txt" C.TRAIN.BATCH_SIZE = 6 C.TRAIN.INPUT_SIZE = [320, 352, 384, 416, 448, 480, 512, 544, 576, 608] __C.TRAIN.DATA_AUG = True C.TRAIN.LEARN_RATE_INIT = 1e-4 C.TRAIN.LEARN_RATE_END = 1e-6 C.TRAIN.WARMUP_EPOCHS = 2 C.TRAIN.FISRT_STAGE_EPOCHS = 20 C.TRAIN.SECOND_STAGE_EPOCHS = 20000 __C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_coco_demo.ckpt"

TEST options

__C.TEST = edict()

C.TEST.ANNOT_PATH = "./data/dataset/visdrone_test.txt" C.TEST.BATCH_SIZE = 2 C.TEST.INPUT_SIZE = 544 __C.TEST.DATA_AUG = False C.TEST.WRITE_IMAGE = True C.TEST.WRITE_IMAGE_PATH = "./data/detection/" C.TEST.WRITE_IMAGE_SHOW_LABEL = True C.TEST.WEIGHT_FILE = "./checkpoint/yolov3_test_loss=9.2099.ckpt-5" __C.TEST.SHOW_LABEL = True C.TEST.SCORE_THRESHOLD = 0.3 __C.TEST.IOU_THRESHOLD = 0.45

`

qncsn2016 commented 4 years ago

maybe you can reduce the learn_rate first, if it doesn't work, try to look for errors in your code and datasets?

llmpass commented 4 years ago

@juanmanuelrq Have you solved this problem? I'm training VOC dataset, I got test loss = NAN, but train loss equals to sth. reasonable.

Theriyadh commented 4 years ago

This indicates that you have a problem with train txt file what format are using ? it should be Filepath x1,y1,x2,y2 no headers @llmpass @juanmanuelrq

MC1016 commented 4 years ago

This indicates that you have a problem with train txt file what format are using ? it should be Filepath x1,y1,x2,y2 no headers @llmpass @juanmanuelrq

Train loss: nan Test loss: nan,This happened to me at the beginning of training,but the format of train.txt is same as you said

MuhammadAsadJaved commented 4 years ago

@juanmanuelrq Have you resolved the issue?

I have the same problem.

all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.2686-nan.ckpt-1" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.2071-nan.ckpt-2" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1809-nan.ckpt-3" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1537-nan.ckpt-4" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1885-nan.ckpt-5" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1779-nan.ckpt-6" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-7" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-8" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-9" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-10" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-11" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-12" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-13" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-14" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-15" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-16" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-17"

I am training one class and dataset is about 7000 images.

qncsn2016 commented 4 years ago

I read the following issues and solved the problem https://github.com/YunYang1994/tensorflow-yolov3/issues/294 https://github.com/YunYang1994/tensorflow-yolov3/issues/350 https://github.com/YunYang1994/tensorflow-yolov3/issues/170 https://github.com/YunYang1994/tensorflow-yolov3/issues/149

MuhammadAsadJaved commented 4 years ago

@qncsn2016 Thank you so much.

yjsdut commented 3 years ago

@juanmanuelrq Have you solved this problem? I'm training VOC dataset, I got test loss = NAN, but train loss equals to sth. reasonable.

hello,I ran into the same problem and wanted to reinitialize the VOC dataset instead of training on the basis of Coco's pre-training weights.Test_loss =nan was the first epoch when I retrained VOC. How did you solve the problem?Thank you very much