Closed AlexeyAB closed 3 years ago
Hi @AlexeyAB
So, We have to change the uc_normalizer from 1.0 to 0.1 in Gaussian_yolov3_BDD.cfg. is it right?
@zpmmehrdad Yes.
@glenn-jocher Hi,
Do you use static learning rate = 0.00261? Or do you use SGDR (cosine) - constantly decreasing learning rate?
learning_rate=0.00261
momentum=0.949
@AlexeyAB I use original darknet LR scheduler, with drops of *=0.1 at 80% and 90% of total epochs. It's true that that a smoother drop may have a slight benefit (I think BOF paper showed this), but it's likely a very minimal effect. See https://github.com/ultralytics/yolov3/issues/238
At least for some datasets, it seems that iou_n=0.07
is too low value for GIoU, and iou_n=0.5
is much better: https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425
@AlexeyAB yes, you should definitely examine the value of each loss component to ensure that the balancing parameters produce roughly equal loss between the 3 components. In ultrayltics/yolov3 they produce magnitudes of about 5, 5, 5 for GIoU, obj, cls on COCO epoch 0. If they produce different magnitudes here, you should adjust accordingly.
BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?
In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?
@glenn-jocher
BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?
What do you mean? Just each final activation produce separate delta, which is backpropagated without changes.
In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?
It is just because smaller objects have fewer pixels, especially after resizing to the network size 416x416. I think we should use more anchors, use routes to the lower layers and special blocks (which have many layers but don't lose detailed information) - for the small objects.
@glenn-jocher Also did you think about rotation/scale-invariant features like SIFT/SURF (rotation/scale-invariant conv-layers or something else)?
@AlexeyAB yes I've worked a lot with SURF and SIFT, but don't get these confused with object detection. SURF is a faster version of SIFT, they are not AI algorithms, their purpose is to match points in one image to points in a second image by comparing feature vectors between possible point pairs. This is useful is Structure From Motion (SFM) applications like AR where its necessary to know the camera motion between frames to reconstruct a 3D scene, or simply to find an object in a second image that exists in a first image.
But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.
Yes a more targeted strategy to lower layers is a good idea. But the point I was making is that I think the darknet loss function (if there are no balancers) treats each element the same, whereas the ultralytics loss treats each layer the same (i.e. for 416 there would be 507 + 2028 + 8112 = 10467 loss elements in the 3 layers).
The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss:
loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)
I'm thinking if I take the mean of the entire 10467 anchors instead, this would result in more effective training of the smaller object layers. I tried this before with poor effect, but maybe I should try again. The current COCO results are:
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.243 <--
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.450
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.514
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.422 <--
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.640
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.707
@glenn-jocher
The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss:
loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)
Does this calculation only affect the display of the total loss on the screen? Or does it somehow affect the value of each delta
that will be redistributed?
F.e.
output[i] = 0.2
(for x=2,y=3, anchors=1, yolo-layer-3
)delta_class[i] = 1 - p = 1 - 0.2 = 0.8
(for the same x=2,y=3, anchors=1, yolo-layer-3
)
then after this loss-calculation loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)
, what value will be back-propagated in ultralytics-yolo, is it 0.8
or what?Both are not color invariant.
DNN can achieve color/scale/rotation invarinat only due to a large amount of filters (millions of parameters). Surf is very demanding on resources, so we will not be able to make the network completely only out of millions of Surfs. But perhaps we can apply a certain amount of them in some layers, for example, with subsampling. Or something else, there are many algorithms: SIFT, SURF, BRIEF, ORB...
But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.
It depends on whether we want to detect only red cars or any color. Otherwise, we will either have to add surfers for the blue car etc..., or use color-invariant SURF-descriptors: https://link.springer.com/chapter/10.1007/978-3-642-35740-4_6
SURF is not an object detection method, it is a method of matching areas in an image that can be used for detection/tracking objects, with rotation/scale-invariance.
In all surf tutorials, Surf is demonstrated as a method for comparing whole images, rather than individual objects. The reason - just because SURF has different efficiencies for different key points in the image, it is just assumed that on a separate object there may be or may not be good points, but on the whole image they should be much more likely.
Many years ago I successfully used the Surf Extractor Ptr<SurfDescriptorExtractor> extractor = new SurfDescriptorExtractor();
to track an object with rotation and scale invariance with occlusion and long disappearances, with instant training. Because we can calculate surf descriptors for any point in any area of the image (including where the object) and save them to a file.
After detection by using Surf Extractor, the area with the object was rotated and scaled, and then other refinement algorithms for object recognition/detection/comparison were applied asynchronously, like Similarity check (PNSR and SSIM) on the GPU
, HaarCascades / Viola–Jones object detection
, ...
In my case,iou_normalizer=0.07
seems better than iou_normalizer=0.5
in yolov3+Gaussian+CIoU
iou_normalizer=0.07
iou_normalizer=0.5
@nyj-ocean
@nyj-ocean oh that's an impressive difference! @AlexeyAB I don't think we should get too hung up exactly the best normalizer for every situation, because I think they will all be different depending on many factors, including the custom data, number of classes, class frequency, etc.
I think a robust balancing method would probably sacrifice epoch 0 simply to see what the default balancers produce (i.e. 1, 1, 1), and then restart training using those results to balance the loss components. The steps would roughly be:
@glenn-jocher Yes, just I think we should keep 1, 0.1, 0.1 for box, obj, cls, at least for high AP@75 and may be for AP@50 too
from your cfg.zip, those two cfg both is yolo layers rather than gaussian-yolo layers. please confirm. thanks.
@nyj-ocean Thanks. In your cfg-file there are [yolo] layers instead of [Gaussian_yolo].
Tested: https://github.com/AlexeyAB/darknet/blob/master/cfg/csresnext50-panet-spp-original-optimal.cfg
+4.8% mAP@0.5 on MS COCO test-dev
@nyj-ocean
Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3)
4 months ago, but it is very slow https://github.com/AlexeyAB/darknet#pre-trained-models
Open a new issue. May be I will benchmark SE-module and will check can I improve SE-speed.
@AlexeyAB
Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3) 4 months ago
I add Squeeze-and-Excitation blocks
to yolov3.cfg
.
Then train with my dataset
mAP | |
---|---|
yolov3 | 86.03 |
yolov3+senet | 85.78 |
The mAP
of yolov3+senet
is lower than yolov3
The result is strange
@nyj-ocean If you add SE-block to the darkent53 backbone, then you should retrain classifier for using new pre-trained weights file.
In my case,
iou_normalizer=0.07
seems better thaniou_normalizer=0.5
inyolov3+Gaussian+CIoU
iou_normalizer=0.07
iou_normalizer=0.5
@AlexeyAB @nyj-ocean The "good hyperparameters" is effective. But why does loss function not converge normally?
Test models with good hyperparameters: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-557499650 and https://github.com/AlexeyAB/darknet/issues/4147#issuecomment-560165394
iou_normalizer=1
for [yolo]iou_normalizer=0.07
for [yolo] + C/D/GIoUiou_normalizer=0.1
anduc_normalizer=0.1
for [Gaussian_yolo]iou_normalizer=0.07
anduc_normalizer=0.07
for [Gaussian_yolo] + C/D/GIoU