RE: Average Loss accelerates to infinity

gurmeetsidhu commented 7 years ago

I don't know why but the training doesn't seem to be working well.

It is gradually descending in the loss function but approaching a meager value of 1.4 then accelerates up to very large numbers 100,000-1,000,000. And mind you I am only at 400 counts, no where close to the suggested approx. 5000 for training 5 classes.

Thank you so much for any advice.

PS. I also turn random=1 and same result

AlexeyAB commented 7 years ago

Did you use this Fork-darknet for Windows with latest commits?
Check your dataset by using Yolo-makr - are all bounded boxes correct: https://github.com/AlexeyAB/Yolo_mark
Show your changes in cfg-file, and command line that you used for training.

gurmeetsidhu commented 7 years ago

Thanks for the prompt reply,

Yes I used this fork and had the latest commits as of March 4th, 2017
I pulled ImageNet downloads for oranges, bananas, apple, water bottle, and mugs along with bounding boxes. Went in and labelled a few bananas but everything seems labelled. I heard you mention that everytime an object appears in an image it has to be labelled. ImageNet doesn't do a good job of that so do I have to go in and check all 4000 images myself?
Here is the command I used to train darknet.exe detector train data/obj.data cfgs/yolo-obj.cfg weights/darknet19_448.conv.23 ^^ Just reorged my cfgs and weights b/c it was getting cluttered. And finally my edits to the CFG `` [convolutional] size=1 stride=1 pad=1 filters=50 activation=linear

[region] anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52 bias_match=1 classes=5 coords=4 num=5 softmax=1 jitter=.2 rescore=1

object_scale=5 noobject_scale=1 class_scale=1 coord_scale=1

absolute=1 thresh = .6 random=1 `` Note change of random, classes, and filters. Thats all I really changed.

Thanks

AlexeyAB commented 7 years ago

Commits on Mar 14 or Mar 4?
Yes, you should to go in and check all 4000 images by press SPACE in Yolo-mark.
.cfg looks all right

gurmeetsidhu commented 7 years ago

Mar 4th. What's the best way to refork?
Okay will do, gonna get my buddies to help me
Okay thanks. Also is it a good idea to perhaps decrease learning rate. Scared its kinda zigzagging and throwing itself off or something ...

AlexeyAB commented 7 years ago

Update your Yolo v2 from this fork. There was critical bug-fix "Fixed training with rand_s()" at 14 March.

gurmeetsidhu commented 7 years ago

Okay I will refork and relabel the images, and get back to you on this if it works

Also I tried attempting to run the imagenet1k dataset and it said error out of memory? How can I avoid this issue?

AlexeyAB commented 7 years ago

Try to increase subdivison=64 in your cfg-file.

gurmeetsidhu commented 7 years ago

Thank you so much, it fixed the issue of it heading to negative infinity. But how long should it take to reach average loss of 0.06. Right now its just dipping down to 0.12 and then back up to 0.16 and just shuffling there. at 3000 count.

AlexeyAB commented 7 years ago

This need not necessarily be 0.06. avg_loss just has to stop noticeably decreasing, then you can stop training.

gurmeetsidhu commented 7 years ago

Okay so it trained IOU average loss to approx 0.14 but when I run the detector even at -thresh 0.01 it outputs nothing...

AlexeyAB commented 7 years ago

I just trained Tiny Yolo v2 - 3000 iterations by using this command:

darknet.exe detector train data/obj.data tiny-yolo-obj.cfg darknet19_448.conv.23

tiny-yolo-obj.cfg diff:

diff

And now I can detect objects by using Tiny Yolo v2 with -thresh 0.02:

darknet.exe detector test data/obj.data tiny-yolo-obj.cfg tiny-yolo-obj_3000.weights -thresh 0.02

reslut_tiny

AlexeyAB commented 7 years ago

@gurmeetsidhu

What GPU do you use?
Do you compile Yolo with cuDNN-library?

gurmeetsidhu commented 7 years ago

I am using a GTX-980M
Yes I did compile with cuDNN libraries

Also if I run it with a threshold of 0. I get results back which makes me believe that there is something gravely wrong with how it was trained. I dont understand what's going on. Here is a sample input image

Also here is how I have my train.txt and folder setup:

All images are returned with a confidence of 0%, compared to your camaro's 68%:

And here is the end of the training cycle:

Hopefully that gives you some clarification as to what's going on here ...

gurmeetsidhu commented 7 years ago

Okay I will attempt to retrain it with tiny-yolo and see if issue persists

AlexeyAB commented 7 years ago

And here is the end of the training cycle:

Something went wrong, so you got CUDA Error at the end of training.

gurmeetsidhu commented 7 years ago

I have no idea why that error pops up. Any way I can look at a log and perhaps share the results? Doesn't seem to hinder training before and have got it sometimes on a demo, restart and it seems to fix itself. My guess is some sort of usage elsewhere by the graphics card is leading to a chain reaction.

gurmeetsidhu commented 7 years ago

Have you changed your yolo.c file by any chance to something like this before/after you train to help identify the camaro?

char *voc_names[] = {"stopsign", "yeildsign"}; image voc_labels[CLASSNUM];

here is the original line

char *voc_names[] = {"aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"};

AlexeyAB commented 7 years ago

@gurmeetsidhu

No, this was required for old version Yolo v1: https://github.com/AlexeyAB/yolo-windows

Now in Yolo v2 all objects names described in obj.names to wich referenced obj.data: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

gurmeetsidhu commented 7 years ago

ughhh ... I don't know why these CUDA errors are occuring, and I can't seem to find a log to debug it ... Have you managed to get Yolo-9000 to work, perhaps I can figure out how to label only the items I want. At this point I just don't know how to train my dataset effectively. Could you send me your camaro/lambo dataset and perhaps I can try training on that and see if the error persists?

AlexeyAB commented 7 years ago

@gurmeetsidhu

Unfortunately I can not give my dataset - it's private.

But you can use the standard Pascal Voc dataset: https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data

MyVanitar commented 7 years ago

Let me give you my two cent.

There is definitely something wrong with your dataset or the file paths. try to use the folder names as Alexey mentioned in the description.

You should avoid too big and too small images in your dataset. if you are training for 416*416, then slightly bigger than this size is good. something around 450*450 to 600*600.

Does your system become choppy and very slow during training? then make sure you have at least 8G RAM, or your images are quite big. try to close other software during training.

Also do not install any GPU driver except the one which comes with CUDA. I mean just install the latest CUDA and let it goes.

gurmeetsidhu commented 7 years ago

Okay Vanitor,

In the back of my head I had a feeling it was the imagenet dataset. I've eliminated images that fall below 450450 pixels and above 600600. I had some images that were 1080p so quite likely that they may have led to some ram issues.

I have closed apps when I run and that some crashed were associated with me trying to open a chrome and checking a video. So perhaps that is again the issue. I do have 8gbs of ram and my structure is as alexeys

Thanks for your advice. Currently trying it with a cleansed dataset which is approx half the size. 100 images now per category

gurmeetsidhu commented 7 years ago

All right so its settled on 0.24 average loss. Quite high ... and when I ran detection it found nothing

When i lower threshold to 0.05. I get a few pickups and they're all oranges and it looks like this ...

I think due to eliminating all the images. I was working with a dataset of only 200 images to differentiate 5 classes. But it doesn't seem like I am going to be able to get this to work without spending a significant time finding another 900 or so images per class and labeling them

MyVanitar commented 7 years ago

@gurmeetsidhu

Yes, in my case I had one class and 200 images for training and 30 for validation. I told you the previous tips to solve your errors, otherwise if you want high accuracy, that's another topic.

gurmeetsidhu commented 7 years ago

Yes thank you very much Hesam. I guess this issue is resolved and if I have time for my project I will try to gather a larger database and see if that works.

AlexeyAB / darknet

RE: Average Loss accelerates to infinity #44