Nan Loss in the middle of training

experiencor / keras-yolo2

Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).

MIT License

1.73k stars 784 forks source link

Nan Loss in the middle of training #149

Open anvenkat09 opened 6 years ago

anvenkat09 commented 6 years ago

Hi Experiencor,

Wonderful implementation! I just had a couple of questions:

I have written my own feature extractor CNN similar to ResNet and have been using it in place of the Full-Yolo classifier in your step by step notebook. After about 140 epochs of training, the loss becomes NAN. I came up with a couple of options, could you please recommend what I should do?

Use SGD instead of Adam and continue training after loading my currently trained weights
Just reset Adam optimizer and continue training
Use clipnorm / clipvalue to clip the gradient from exploding (if I should use this, what value is recommended?)

Thanks so much!

experiencor commented 6 years ago

@anvenkat09 still got this problem?

11mhg commented 6 years ago

Hey Experiencor,

I am actually getting a similar problem but on the recall. On certain images, it seems that the recall is simply 0. This affects the average recall and after a while, it simply becomes 0. I read somewhere that I should probably be looking into using clipnorm or clipvalue to clip the gradient from exploding and use a grid search to find proper values for that.

What do you think? Am I doing something wrong?

Thanks :)

anvenkat09 commented 6 years ago

@experiencor No I fixed it. Don't remember exactly how, but it's not an issue anymore. It had something to do with my pretrained weights

experiencor commented 6 years ago

@11mhg Recall approaching 0 is another problem. It normally happens during warm-up, but not actual training. Please refer to the readme for tips.

11mhg commented 6 years ago

Hey @experiencor , I followed the readme you have, and I noticed I'm still receiving this problem. I allowed the net to perform warmup and then I did proper training and the actual recall never changed from 0.

experiencor commented 6 years ago

@11mhg What dataset did you use? You can try the whole thing again as you may miss something in the first run, e.g. you need to load the warmed-up weights before the actual training process.

11mhg commented 6 years ago

Hey! Thanks for the quick replies.

I'm using the Coco dataset with just a small amount of labels chosen. When I warm it up, it starts at a decent recall and goes down to zero. I load the warmed up weights and then do training proper and it seems that the recall starts at zero and never goes up. I'm currently training again with some modified parameters, so I'll let you know how that goes, but if you happen to have insight on Coco, that would be great!

Thanks again!

experiencor commented 6 years ago

@11mhg The warmup training looks right to me, but the actual training is odd. You may try to train the detector for just one class to see how it goes.

11mhg commented 6 years ago

So what I'm finding is that the recall converges quickly to zero during warmup, but when I do the training it very slowly increases recall. So for the one label "person", after 56 or so epochs I get a recall average of 0.0085..... What is a normal recall average I should get?

EDIT: To note, I'm using the 2017 dataset and adapted the parse annotations to use the cocoapi

experiencor commented 6 years ago

@11mhg Current recall (an estimate of mAP) should be more than 0.3 for good detections in my experience. I assume that you have carefully checked the labels of the images.

11mhg commented 6 years ago

Yes, I've verified the labels and done a multitude of tests on the parse annotations to make sure it is correct. I will take a look at using perhaps an older COCO dataset to see what happens.

11mhg commented 6 years ago

okay, what I've found is that simply using the weights trained on COCO 2014 from darknet, I managed to fine tune on the COCO 2017 dataset with a recall of around 0.34!

ghost commented 6 years ago

@anvenkat09 can you tell me what you do, i have same problem with resnet

tamersalama commented 6 years ago

Trying to update the nan related issues - what worked for me is adding images and annotations added to "valid_image_folder" (I previously relied on having the training ones split 80/20 as per the readme - but I got nan for losses). I also changed the traing nb_epochs to 10 - and will likely need more (from 1) Actually, it might have to do with model anchors. Generating new ones (other than the ones given in the README example) lead to the NAN values.