AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.71k stars 7.96k forks source link

Test models with good hyperparameters - +4.8% mAP@0.5 on MS COCO test-dev #4430

Closed AlexeyAB closed 3 years ago

AlexeyAB commented 4 years ago

Test models with good hyperparameters: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-557499650 and https://github.com/AlexeyAB/darknet/issues/4147#issuecomment-560165394

batch=64
subdivisions=8
width=608
height=608
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.00261 # or 0.122 (so iou~=3.29 and cls & obj ~= 47) as in @glenn-jocher yolov3
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1

mosaic=1
[yolo] # or  [Gaussian_yolo]
...

jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
scale_x_y = 1.05   # 1.05, 1.10, 1.20
iou_thresh=0.213
cls_normalizer=1.0
iou_normalizer=0.07
uc_normalizer=0.07
iou_loss=ciou
nms_kind=greedynms
beta_nms=0.6
beta1=0.6

sctrueew commented 4 years ago

Hi @AlexeyAB

So, We have to change the uc_normalizer from 1.0 to 0.1 in Gaussian_yolov3_BDD.cfg. is it right?

AlexeyAB commented 4 years ago

@zpmmehrdad Yes.

AlexeyAB commented 4 years ago

@glenn-jocher Hi,

Do you use static learning rate = 0.00261? Or do you use SGDR (cosine) - constantly decreasing learning rate?

learning_rate=0.00261
momentum=0.949
glenn-jocher commented 4 years ago

@AlexeyAB I use original darknet LR scheduler, with drops of *=0.1 at 80% and 90% of total epochs. It's true that that a smoother drop may have a slight benefit (I think BOF paper showed this), but it's likely a very minimal effect. See https://github.com/ultralytics/yolov3/issues/238

AlexeyAB commented 4 years ago

At least for some datasets, it seems that iou_n=0.07 is too low value for GIoU, and iou_n=0.5 is much better: https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425

glenn-jocher commented 4 years ago

@AlexeyAB yes, you should definitely examine the value of each loss component to ensure that the balancing parameters produce roughly equal loss between the 3 components. In ultrayltics/yolov3 they produce magnitudes of about 5, 5, 5 for GIoU, obj, cls on COCO epoch 0. If they produce different magnitudes here, you should adjust accordingly.

BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?

In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?

AlexeyAB commented 4 years ago

@glenn-jocher

BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?

What do you mean? Just each final activation produce separate delta, which is backpropagated without changes.

In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?

It is just because smaller objects have fewer pixels, especially after resizing to the network size 416x416. I think we should use more anchors, use routes to the lower layers and special blocks (which have many layers but don't lose detailed information) - for the small objects.

AlexeyAB commented 4 years ago

@glenn-jocher Also did you think about rotation/scale-invariant features like SIFT/SURF (rotation/scale-invariant conv-layers or something else)?

glenn-jocher commented 4 years ago

@AlexeyAB yes I've worked a lot with SURF and SIFT, but don't get these confused with object detection. SURF is a faster version of SIFT, they are not AI algorithms, their purpose is to match points in one image to points in a second image by comparing feature vectors between possible point pairs. This is useful is Structure From Motion (SFM) applications like AR where its necessary to know the camera motion between frames to reconstruct a 3D scene, or simply to find an object in a second image that exists in a first image.

But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.

Yes a more targeted strategy to lower layers is a good idea. But the point I was making is that I think the darknet loss function (if there are no balancers) treats each element the same, whereas the ultralytics loss treats each layer the same (i.e. for 416 there would be 507 + 2028 + 8112 = 10467 loss elements in the 3 layers).

The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss: loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)

I'm thinking if I take the mean of the entire 10467 anchors instead, this would result in more effective training of the smaller object layers. I tried this before with poor effect, but maybe I should try again. The current COCO results are:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.243 <--
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.450
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.514

 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.422 <--
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.640
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.707
AlexeyAB commented 4 years ago

@glenn-jocher

The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss: loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)

Does this calculation only affect the display of the total loss on the screen? Or does it somehow affect the value of each delta that will be redistributed?

F.e.


Both are not color invariant.

DNN can achieve color/scale/rotation invarinat only due to a large amount of filters (millions of parameters). Surf is very demanding on resources, so we will not be able to make the network completely only out of millions of Surfs. But perhaps we can apply a certain amount of them in some layers, for example, with subsampling. Or something else, there are many algorithms: SIFT, SURF, BRIEF, ORB...

But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.

It depends on whether we want to detect only red cars or any color. Otherwise, we will either have to add surfers for the blue car etc..., or use color-invariant SURF-descriptors: https://link.springer.com/chapter/10.1007/978-3-642-35740-4_6


SURF is not an object detection method, it is a method of matching areas in an image that can be used for detection/tracking objects, with rotation/scale-invariance.

In all surf tutorials, Surf is demonstrated as a method for comparing whole images, rather than individual objects. The reason - just because SURF has different efficiencies for different key points in the image, it is just assumed that on a separate object there may be or may not be good points, but on the whole image they should be much more likely.

Many years ago I successfully used the Surf Extractor Ptr<SurfDescriptorExtractor> extractor = new SurfDescriptorExtractor(); to track an object with rotation and scale invariance with occlusion and long disappearances, with instant training. Because we can calculate surf descriptors for any point in any area of ​​the image (including where the object) and save them to a file.

After detection by using Surf Extractor, the area with the object was rotated and scaled, and then other refinement algorithms for object recognition/detection/comparison were applied asynchronously, like Similarity check (PNSR and SSIM) on the GPU, HaarCascades / Viola–Jones object detection, ...

nyj-ocean commented 4 years ago

In my case,iou_normalizer=0.07 seems better than iou_normalizer=0.5 in yolov3+Gaussian+CIoU

iou_normalizer=0.07 chart

iou_normalizer=0.5 0 5-gaussian-ciou-chart

AlexeyAB commented 4 years ago

@nyj-ocean

glenn-jocher commented 4 years ago

@nyj-ocean oh that's an impressive difference! @AlexeyAB I don't think we should get too hung up exactly the best normalizer for every situation, because I think they will all be different depending on many factors, including the custom data, number of classes, class frequency, etc.

I think a robust balancing method would probably sacrifice epoch 0 simply to see what the default balancers produce (i.e. 1, 1, 1), and then restart training using those results to balance the loss components. The steps would roughly be:

  1. Set balancers to 1, 1, 1 for box, obj, cls
  2. Train up to 1 epoch / 10 minutes / 1000 iterations, saving loss component means.
  3. Set balancers to inverse loss component means.
  4. Train normally.
AlexeyAB commented 4 years ago

@glenn-jocher Yes, just I think we should keep 1, 0.1, 0.1 for box, obj, cls, at least for high AP@75 and may be for AP@50 too

tuteming commented 4 years ago

from your cfg.zip, those two cfg both is yolo layers rather than gaussian-yolo layers. please confirm. thanks.

AlexeyAB commented 4 years ago

@nyj-ocean Thanks. In your cfg-file there are [yolo] layers instead of [Gaussian_yolo].

AlexeyAB commented 4 years ago

Tested: https://github.com/AlexeyAB/darknet/blob/master/cfg/csresnext50-panet-spp-original-optimal.cfg

+4.8% mAP@0.5 on MS COCO test-dev

image

nyj-ocean commented 4 years ago

@AlexeyAB

图片 图片


图片 Squeeze-and-Excitation Networks.pdf

AlexeyAB commented 4 years ago

@nyj-ocean Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3) 4 months ago, but it is very slow https://github.com/AlexeyAB/darknet#pre-trained-models

Open a new issue. May be I will benchmark SE-module and will check can I improve SE-speed.

nyj-ocean commented 4 years ago

@AlexeyAB

Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3) 4 months ago

I add Squeeze-and-Excitation blocks to yolov3.cfg. Then train with my dataset

  mAP
yolov3 86.03
yolov3+senet 85.78

The mAP of yolov3+senet is lower than yolov3 The result is strange

AlexeyAB commented 4 years ago

@nyj-ocean If you add SE-block to the darkent53 backbone, then you should retrain classifier for using new pre-trained weights file.

becauseofAI commented 4 years ago

In my case,iou_normalizer=0.07 seems better than iou_normalizer=0.5 in yolov3+Gaussian+CIoU

iou_normalizer=0.07 chart

iou_normalizer=0.5 0 5-gaussian-ciou-chart

@AlexeyAB @nyj-ocean The "good hyperparameters" is effective. But why does loss function not converge normally?