Training started with nan values

bennycheung / Food100_YOLO_Tools

Python tools and configuration files for Food100 dataset DarkNet YOLO training

56 stars 22 forks source link

Training started with nan values #2

Closed getsanjeev closed 5 years ago

getsanjeev commented 5 years ago

mask_scale: Using default '1.000000' Loading weights from darknet19_448.conv.23...Done! Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005 Resizing 544 Loaded: 0.228742 seconds Region Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.434100, Avg Recall: -nan, count: 0

getsanjeev commented 5 years ago

@bennycheung Can you please have some time on this?

bennycheung commented 5 years ago

Unfortunately, with the limited information, I cannot guess what is the cause of your problem. Are you training with your own data set? and your configuration files? This could be the the exploding gradient problem. https://machinelearningmastery.com/exploding-gradients-in-neural-networks/

getsanjeev commented 5 years ago

I am working with your data, have followed the readme,

downloaded the WECFOOD100 dataset only
using the provided data generating script (food100_generate_bbox_file.py), have generated the labels for each image
using the provided script (food100_split_for_yolo.py)
downloaded the weights from the given link
started the training

In your script available, do you have classes from 0 (not 1) as darknet expects?

I am running it on a linux machine, 6GB GC-RAM, GTX1060. The training starts with nan values.

getsanjeev commented 5 years ago

@bennycheung I see you have ensured this class issue. for class 1, it has 0 it the label. So that should not be the issue.

bennycheung commented 5 years ago

Thanks for the additional info! Did you let it run for a little longer, does the nan value goes away? The other possibility is your graphics card has less RAM. You may need to turn the batch size, so that it does not explode your neural network memory.

getsanjeev commented 5 years ago

Yes I tried with 16 batch size and 4 subdivisions, still same result. Should I allow it to run for a long time? Also there might be some issue related to GPU. I dont think its using GPU memory. Let me see. Thanks!