When I train my dataset, after serveral epoches, loss is Nan.

GOATmessi8 / ASFF

yolov3 with mobilenet v2 and ASFF

GNU General Public License v3.0

1.05k stars 216 forks source link

When I train my dataset, after serveral epoches, loss is Nan. #66

Open John-Yao opened 4 years ago

John-Yao commented 4 years ago

I had tried the repo to train our dataset which has been trained successfully with Centernet. However, I try several configs(eg rm asff, modify lr, load coco weight...). It got nan loss after several epoch. Now the Only successful exp is using COCO weight and not modify the num_class to our dataset, which indices that the pretrained guide anchor or other part in YoloHead is very important. Our dataset is very small(12 iters for a epoch), so I also tried modify the warmup epoch to 20. But it got Nan loss Also. Could you provide some suggestion?

John-Yao commented 4 years ago

I tried to modify the warmup epoch to 50 and train successfully. The are some questions about the process with mixup.

The weight of box is only set to obj_loss. Why not deal with cls_loss and reg_loss?
How to use mixup in FasterRCNN. use in only in RPN cls_loss?

I can not figure out the details in the paper “Bag of Freebies for Training Object Detection Neural Networks” and appreciate it for you reply!

hhaAndroid commented 4 years ago

The learning rate is set too large, it will be nan, you can consider reducing the learning rate or gradient clipping. I also had this problem

LUOBO123LUOBO123 commented 4 years ago

can you open your soruce?

John-Yao commented 4 years ago

@LUOBO123LUOBO123 I just modified the lines commented above! In details , modify TRAIN.BURN_IN(in cfg file) to 50 work for me.

LUOBO123LUOBO123 commented 4 years ago

@LUOBO123LUOBO123 I just modified the lines commented above! In details , modify TRAIN.BURN_IN(in cfg file) to 50 work for me.

Hi,I want to train my dataset,can you put your source code into your github?Thank you for your help.