AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

The average loss is large even at iteration 13700 or above, why? #6113

Open huizhang2017 opened 4 years ago

huizhang2017 commented 4 years ago

Hi, everyone, I have a problem when I train my own data. Even I have trained for about 13708 iterations, the average loss is still about 30 to 40. When the iteration is about 2000, the average loss is at the same level, so it did not decrease after the iteration 2000, why? Is there anyone meet with the same problem?

According to the training info outputed below, you will find the total loss is sometimes small , such as about 1.4, sometimes large such as 1484 or 3282, this is caused by the large class_loss or iou_loss. Could anyone explain this strange behavior during training? How should I fix it? Thanks.

I have another question: "13708: 39.763336, 37.974380 avg loss, 0.002000 rate, 25.004000 seconds, 1096640 images" There are two avg loss? The first avg loss(39.763336) is valid loss? The second avg loss (37.974380) is the training loss? Do I understand right?


v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.688547, GIOU: 0.677843), Class: 0.913514, Obj: 0.506623, No Obj: 0.007988, .5R: 0.872727, .75R: 0.345455, count: 55, class_loss = 28.654789, iou_loss = 431.445282, total_loss = 460.100067 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.845779, GIOU: 0.839122), Class: 0.999525, Obj: 0.326064, No Obj: 0.000727, .5R: 1.000000, .75R: 1.000000, count: 1, class_loss = 0.454901, iou_loss = 0.940173, total_loss = 1.395074 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.671898, GIOU: 0.641226), Class: 0.934388, Obj: 0.261634, No Obj: 0.003929, .5R: 0.926667, .75R: 0.300000, count: 150, class_loss = 115.049507, iou_loss = 3167.384033, total_loss = 3282.433594 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.663388, GIOU: 0.646405), Class: 0.876559, Obj: 0.495784, No Obj: 0.006104, .5R: 0.903846, .75R: 0.269231, count: 52, class_loss = 29.064081, iou_loss = 534.841248, total_loss = 563.905334 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.896829, GIOU: 0.894691), Class: 0.999342, Obj: 0.922417, No Obj: 0.001088, .5R: 1.000000, .75R: 1.000000, count: 1, class_loss = 0.006023, iou_loss = 3.362076, total_loss = 3.368100 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.780473, GIOU: 0.766881), Class: 0.975726, Obj: 0.237571, No Obj: 0.002936, .5R: 1.000000, .75R: 0.690000, count: 100, class_loss = 69.861595, iou_loss = 1414.521118, total_loss = 1484.382690 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.696069, GIOU: 0.682558), Class: 0.960978, Obj: 0.476977, No Obj: 0.011986, .5R: 0.942857, .75R: 0.342857, count: 105, class_loss = 47.378819, iou_loss = 542.144653, total_loss = 589.523499 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.862177, GIOU: 0.862177), Class: 0.999735, Obj: 0.973714, No Obj: 0.001010, .5R: 1.000000, .75R: 1.000000, count: 1, class_loss = 0.293477, iou_loss = 1.136457, total_loss = 1.429934 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.791763, GIOU: 0.782755), Class: 0.964215, Obj: 0.198258, No Obj: 0.002607, .5R: 0.976191, .75R: 0.761905, count: 84, class_loss = 63.966724, iou_loss = 1470.874146, total_loss = 1534.840820 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.682726, GIOU: 0.657329), Class: 0.928080, Obj: 0.336688, No Obj: 0.010962, .5R: 0.925000, .75R: 0.300000, count: 80, class_loss = 58.216625, iou_loss = 370.827454, total_loss = 429.044067 v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: -nan(ind), GIOU: -nan(ind)), Class: -nan(ind), Obj: -nan(ind), No Obj: 0.001361, .5R: -nan(ind), .75R: -nan(ind), count: 0, class_loss = 1.490448, iou_loss = 0.000000, total_loss = 1.490448 Syncing... Done!

13708: 39.763336, 37.974380 avg loss, 0.002000 rate, 25.004000 seconds, 1096640 images Loaded: 0.000000 seconds


Hwijune commented 4 years ago

How many dataset images are there?

i think you need more iteration.

WongKinYiu commented 4 years ago

Region 139, Region 150, and Region 161 are 3 different yolo layers. there are usually only 1 object is assigned to region 161, so the loss of region 161 is small. almost all of objects are assigned to region 139, it means objects in your dataset are small.

first you can check the map performance. and you can calculate the anchors of your dataset to check if it is the reason of high loss of region 139.

huizhang2017 commented 4 years ago

How many dataset images are there?

i think you need more iteration.

The total number of training image is about 2340. I will keep on training for another 10000 iterations. Thanks.

huizhang2017 commented 4 years ago

Region 139, Region 150, and Region 161 are 3 different yolo layers. there are usually only 1 object is assigned to region 161, so the loss of region 161 is small. almost all of objects are assigned to region 139, it means objects in your dataset are small.

first you can check the map performance. and you can calculate the anchors of your dataset to check if it is the reason of high loss of region 139.

Thank you for your good advice, I will try. All the sizes of the objects are set to 96x96 on original images, which are aroud 2000x2400.

flowzen1337 commented 4 years ago

@huizhang2017 can you post your yolov4.cfg for your training please? It would help to tackle your problem :-)

How many classes do you train? Do all classes got nearly the same amount of annotations or are there big differences?

huizhang2017 commented 4 years ago

@huizhang2017 can you post your yolov4.cfg for your training please? It would help to tackle your problem :-)

How many classes do you train? Do all classes got nearly the same amount of annotations or are there big differences? There are 52 classes and the data is balanced. The box size of the object are set to 96 x 96, the image sizes are around 2000 x 2400. I copied the yolov4-custom.cfg, and renamed to yolo-obj.cfg。

The content of the cfg file is as follows:

[net]

Testing

batch=1

subdivisions=1

Training

batch=64 subdivisions=40 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.001 burn_in=1000 max_batches = 104000 policy=steps steps=83200,93600 scales=.1,.1

cutmix=1

mosaic=0

:104x104 54:52x52 85:26x26 104:13x13 for 416

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=64 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=32 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-7

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=128 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-10

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=256 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-28

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=512 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-28

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=1024 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-16

[convolutional] batch_normalize=1 filters=1024 size=1 stride=1 pad=1 activation=mish stopbackward=800

##########################

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

SPP

[maxpool] stride=1 size=5

[route] layers=-2

[maxpool] stride=1 size=9

[route] layers=-4

[maxpool] stride=1 size=13

[route] layers=-1,-3,-5,-6

End SPP

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = 85

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[route] layers = -1, -3

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = 54

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[route] layers = -1, -3

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

##########################

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=171 activation=linear

[yolo] mask = 0,1,2 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=52 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 scale_x_y = 1.2 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5

[route] layers = -4

[convolutional] batch_normalize=1 size=3 stride=2 pad=1 filters=256 activation=leaky

[route] layers = -1, -16

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=171 activation=linear

[yolo] mask = 3,4,5 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=52 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 scale_x_y = 1.1 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5

[route] layers = -4

[convolutional] batch_normalize=1 size=3 stride=2 pad=1 filters=512 activation=leaky

[route] layers = -1, -37

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=171 activation=linear

[yolo] mask = 6,7,8 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=52 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1 scale_x_y = 1.05 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5

caikw0602 commented 4 years ago

I have the same training problem, do not how to address it. Look forward to your news.

huizhang2017 commented 4 years ago

I have the same training problem, do not how to address it. Look forward to your news.

I haven't solved this problem. I guess this problem is caused by the data augmentation. And I am trying to set parameter "jitter", But if I set jitter to a small value, mAP is zero. So this is very strange. You can check the issue #6164 for details, I am also waiting for the answer.

NitinDatta8 commented 4 years ago

Your subdivisions cannot be 40 It can be 16 32 64 I guess this is the mistake I didn't check the whole cfg just saw the top part