Loss avg seem to not decrease any more

bao-O commented 5 years ago

I'm training my yolov3 tiny model for face detection. I followed steps that were described in this repo, but after 9000 iterators, I relized that loss avg was abnormal. It stayed around 4. This is what I got when use command ./darknet detector map with yolov3-tiny_9000.weights:

detections_count = 295396, unique_truth_count = 119121  
class_id = 0, name = face, ap = 8.59%        (TP = 11753, FP = 19020) 

 for conf_thresh = 0.25, precision = 0.38, recall = 0.10, F1-score = 0.16 
 for conf_thresh = 0.25, TP = 11753, FP = 19020, FN = 107368, average IoU = 25.94 % 

 IoU threshold = 50 %, used Area-Under-Curve for each unique Recall 
 mean average precision (mAP@0.50) = 0.085882, or 8.59 %

My config file - yolov3-tiny.cfg:

[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=32
subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 3,4,5
anchors =  3,  5,   6, 11,  11, 21,  22, 37,  46, 75, 122,175
classes=1
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 0,1,2
anchors =  3,  5,   6, 11,  11, 21,  22, 37,  46, 75, 122,175
classes=1
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

Those archors are what I got when running ./darknet detector calc_anchors data/voc.data -num_of_clusters 6 -width 416 -height 416 Is this Underfitting? Should I train more iterators, modify my .cfg file, or select other pretrained model and config? Thanks in advance.

bao-O commented 5 years ago

Here's my loss chart. I'm very confused about it chart

LukeAI commented 5 years ago

how large is your dataset? I've found that with very large, very challenging datasets, the loss just doesn't really converge beyond a certain point.

bao-O commented 5 years ago

I used WIDER FACE dataset with training set including 12880 images, validating set including 3200 images. Is problem in dataset?

LukeAI commented 5 years ago

not a problem - just a similar result to me - you have a challenging dataset.

bao-O commented 5 years ago

Was your final model enough good? I guess I'll training up to 50k iterators, if there will be no improvement, I change to use yolov3

LukeAI commented 5 years ago

yeah it was ok - how come ap isn't being plotted on your chart? what command did you run to train with? It's a lot easier to see what's going on when you can see the progress of ap as it trains.

LukeAI commented 5 years ago

btw yolov3-spp is better value than yolov3

bao-O commented 5 years ago

Oh I didn't use -map flag when writing training command. It gets this error when I train with -map flag : "CUDA error: out of memory". Next time I'll try to use it.

bao-O commented 5 years ago

Hey, @LukeAI, thanks for your responses to my own questions. Would you mind if I have another question for you?. I already have a trained model of yolov3 and a model of yolov3-tiny - what I'm I training. I knew the speed of yolov3 is slow on CPU, and in my computer, it was too close so the FPS when detecting face in a video always is very low (< 10fps). Can we speed up that or we must use yolov3-tiny instead?

LukeAI commented 5 years ago

If you are getting "CUDA error: out of memory" - you can decrease memory use by increasing subdivisions in the .cfg

You can speed up yolov3 by decreasing the width and height in the .cfg (multiple of 32) - that's easy so try that first, maybe you will get acceptable accuracy and FPS. You may also find that some other implementation is faster - on the cpu, opencv-dnn is faster than darknet, for example. You could also try one of these alternate network structures https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494148968

I've run a lot of experiments with them and found tiny-yolov3-pan2 to be a good mid-way point between tiny-yolo and yolov3-spp in terms of AP and FPS

AlexeyAB / darknet

Loss avg seem to not decrease any more #3794