AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

Removing principal issues by POLY-YOLO approach #5923

Open Mastemace opened 4 years ago

Mastemace commented 4 years ago

Hi, Thank you for your work, it's great. I'm working with a team of mine on YOLO too. We have published a paper where we show that YOLO includes two principal issues that lead to a situation where labels are overwritten and that the network's capacity is not used efficiently. The removal of the two issues is easy, and it is described in the paper. It leads to a significant increase in precision while the processing speed is preserved. Right now, we add into our project diou and mish, and it seems the precision is increased is even further.

Because you are using the same neck and head in YOLOv4 as YOLOv3 has, even your approach include the principal issues. So I would ask if you are not considering improving your YOLO according to our proposal? If you want, we can connect through mail or webcam, and I can explain to you the theory and principles if something is unclear.

Furthermore, we propose a way how YOLO can be easily improved by instance segmentation functionality without a big negative impact on processing speed. Again, I can give you a deeper explanation if you want.

If you are interesting, see:

My email address for a faster communication can be found here https://ifm.osu.eu/petr-hurtik/27481/

Petr

WongKinYiu commented 4 years ago

Hello,

@AlexeyAB add rewritten counting in https://github.com/AlexeyAB/darknet/commit/6c6f04a9b3960edda232c3edd847a6704b946ee3. and the rewritten rate of YOLOv4 is about 1.7%.

Mastemace commented 4 years ago

Hello,

@AlexeyAB add rewritten counting in 6c6f04a. and the rewritten rate of YOLOv4 is about 1.7%.

Yah, that depends on the dataset, anchors, and resolution. It is evident that datasets, where are a small number of boxes per image (here is also COCO), will not suffer for that. We observed that datasets with a lot of (tenths or hundreds) boxes have huge problems with rewriting of labels, see the table:

rewritten

deep-practice commented 4 years ago

Yes,when I train objects365,the rewritten rate of YOLOv4 is about4.3%+.

AlexeyAB commented 4 years ago

@Mastemace Hi,

Thanks, yes I already read your article, very good idea! Almost everything is clear.

  1. Did you compare Poly-YOLO vs YOLACT++ https://arxiv.org/abs/1912.06218 on the same dataset?

Because you are using the same neck and head in YOLOv4 as YOLOv3 has, even your approach include the principal issues.

  1. We use the same 3 Heads with low-medium-higher resolutions, but we don't use the same Neck in Yolov3 vs Yolov4:

    • Yolov3: FPN
    • Yolov4: SPP + PAN
    • new Yolov4: SPP + CSP-PAN + SAM
  2. Can you show cfg-file, and what anchors and masks did you use for Yolov3 on Cityscapes?

  3. What rewritten labels %, AP, AP50 and FPS can you get for Yolov3 with recalculated anchors and such mask:


https://arxiv.org/pdf/2005.13243.pdf

The first drawback is low precision of the detection of big boxes [7] caused by inappropriate handling of anchors in output layers. The second one is rewriting of labels by each-other due to the coarse resolution. To solve these issues, we design a new approach, dubbed Poly-YOLO, that significantly pushes forward original YOLOv3 abilities.

There are stated 2 problems:

  1. inappropriate handling of anchors
  2. rewriting of labels by each-other due to the coarse resolution

It seems the both 2 problems can be solved just by using the correct mask= values ​​in each yolo layer without changing network-structure or source code, depends on rrewritten counter on a specific dataset.


The main idea is very good - segmentation using polygons with the scaling parameter, and apparently there are 2 very good advantages compared to the half mask:

  1. Poly-YOLO learns shapes and relative sizes independently, this can increase accuracy and facilitate re-identification using embeddings for objects

  2. dynamic number of vertices per polygon

We also mention that independently of our instance segmentation, PolarMask [25] introduces instance segmentation using polygons, which are also predicted in polar coordinates. In comparison with PolarMask, Poly-YOLO learns itself in general size-independent shapes due to the use of the relative size of a bounding polygon according to the particular bounding box. The second difference is that Poly-YOLO produces a dynamic number of vertices per polygon, according to the shape-complexity of various objects.

[24] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. Solov2: Dynamic, faster and stronger. arXiv preprint arXiv:2003.10152, 2020

Mastemace commented 4 years ago

Hi, Thank you for your reply @AlexeyAB .

  1. No, we did not so far. We made the comparison only with Mask R-CNN.
  2. The problem of label rewriting is in the resolution of the output layers in heads. We have a single output scale with a side resolution of 1/4 of the original input, while YOLO has 1/8, 1/16, and 1/32 without dependency on the used neck. It means that in our case, all the anchors have available the high-res. So for the extreme case, the output of the layer for big objects, the high-res output scale includes (x/32)^2 - (x/4)^2 more cells, where x is the side resolution. And that is a big difference, which leads to a high reduction of label rewriting problem; see the table in my previous comment. Of course, to reduce the computational complexity, we use only 6/8 number of filters in the backbone. The result is the same speed, but highly increased precision. And the model is smaller, which is plus. Our model size is 143MB without polygon and 144MB with it.
  3. Our Poly-YOLO is written in Keras/Tensorflow, so we cannot give you .cfg, which is (if I take it correctly) from Darknet.
  4. I'll put them into our code and write you the number of rewritten labels in a few days (right now our HW is fully occupied).

I'm not convinced you can solve the two problems just by a different anchors/mask. The higher resolution solves the label rewriting, and here it is necessary to change the architecture. The proper mask can postpone the problem slightly. The anchor distribution even cannot be generally solved by a mask correctly. In cases like COCO, there is not a problem, so here is YOLOv4 so great. But, when you have a custom dataset where you have a lot of boxes (which is the current/future trend to detect hundreds of objects per image) where the objects have similar size, so there is not the logic split to big-medium-small, the single high-res output scale which contains all the anchors and is created by HC is highly beneficial.

I'm offering you the theory because we have no experience with Darknet, so we are unable to implement it. Also, we know that our TF implementation is not perfect, and we are even not reaching maximum performance. It would be nice to see someone implement all the things correctly and create the really best detector for the speed-precision tradeoff.

AlexeyAB commented 4 years ago

@Mastemace

  1. I'll put them into our code and write you the number of rewritten labels in a few days (right now our HW is fully occupied).

Ok. Maybe in this case there will already be a fairly low% of rewritten labels.

I'm not convinced you can solve the two problems just by a different anchors/mask. The higher resolution solves the label rewriting, and here it is necessary to change the architecture. ... But, when you have a custom dataset where you have a lot of boxes (which is the current/future trend to detect hundreds of objects per image) where the objects have similar size, so there is not the logic split to big-medium-small, the single high-res output scale which contains all the anchors and is created by HC is highly beneficial.

Yes, to use 1/4 resolution, we should change artichecture - add another one block with conv-layers + one [yolo] layer and set all masks to it.

  1. But why do you think that 1/4 resolution is the best way for the most cases and datasets, not 1/2 or 1/8?

  2. If we have several heads 1/2 - 1/64, then we can change output resolution just by moving all masks to the one [yolo] layer that is suitable for this custom dataset, whithout changing architecture - it can be more flexible.

  3. Can you show, what anchors did you use for each: POLY-Yolo and Yolov3?

  4. What IoU-NMS-threshold did you use in POLY-Yolo?


  1. Also, maybe you overestimate the problem with Rewriting-labels and underestimate the problem of NMS For example, lets:

    • we have only 1 [yolo] layer

    • scale of the [yolo] layer = S (i.e. output resolution 1/S, 1/S), i.e. there will be objects located from each other at a distance of S points

    • network resolution = net_w, net_h

    • bbox-size = box_w, box_h

    • iou-nms-thresh = 0.5

    • then if one bbox is shifted relative to another box (with the same size) less than box_w/3 pixels then IoU>0.5 and 2 objects will be fused by NMS into 1, so the impact of NMS-problem will be more than Rewritten-labels-problem

    • if 2 objects with the same size are at distance shift from each other only along x-axis, then IoU = ((box_w - shift)*box_h) / ((box_w + shift)*box_h) = (box_w - shift) / (box_w + shift)

    • if IoU < iou-nms-thresh, then there is no problem with NMS, so (box_w - shift) / (box_w + shift) < iou-nms-thresh

    • if output resolution 1/s, 1/s, then there can be objects located from each other at a distance of S points, so should be (box_w - S) / (box_w + S) < iou-nms-thresh to avoid NMS problems (this is a simplified theory if objects are overlaped only by x-axis and have the same size)

    • From (box_w - S) / (box_w + S) < iou-nms-thresh we get S > (box_w*(1 - iou-nms-thresh)) / (1 + iou-nms-thresh)

    • So if nms-iou-thresh=0.5 then should be S > box_w*0.5/1.5, i.e. S > bbox_w * 1/3, i.e. for example if bbox_w=300, S should be higher than 100, i.e. output resolution should be 608/100 x 608/100 - 6x6, otherwise impact of NMS-problem will be more than Rewritten-labels-problem. I.e. if S=50, and final resolution 12x12, then 2 objects of the same size (300x300) from neighboring cells will be fused into 1 object by NMS

    • So if nms-iou-thresh=0.5 and S=4 (output resolution 1/4 x 1/4), then in this layer can be used anchor_x < 3*s, i.e. should be anchor_x < 12

Conclusion: So it makes no sense to make small S, if you have a small nms-iou-threshold. The higher nms-iou-threshold, the higher S you can use.

What do you think about this simplified theory if objects are overlaped only by x-axis and have the same size?


  1. Have you measured FPS including postprocessing (zero detection removing) and nms, or without it? Because final resolution can greatly affect this. When the task of detecting objects will be completely solved and will be launched on the embedded devices in real time 30 FPS, then yes, it it will be much more convenient to use a single output layer. Up to this point, there is only a compromise between speed and accuracy, which usually requires several output layers.

  2. At the moment, we can try to solve the problem of a single layer with high resolution - I use postprocessing on GPU. In this case, the POLY-yolo approach may be more advantageous. But more tests are needed.

Mastemace commented 4 years ago

Hi, @AlexeyAB

a) The results in our paper are for IOU 0.4 and the fps is including the NMS functionality. Btw, we now try to convert our model to TensorRT, which should increase the computation speed even more, but the comparison will not be fair.

b), The scale 1/4 is a tradeoff between the number of rewritten labels and the size of a symbolic tensor. Bigger scale (such as 1/2) increases the size of the symbolic tensor too much, so it is necessary to reduce the size of batch size. We do not have titan or v100etc, so the size of the model in RAM is crucial for us.

c) I'm not sure we are talking about the same thing. In our case, we are talking about the number of rewritten labels during training, so here NMS is not involved. What we are speaking about: labels are taken. A cell where the box has a center is searched. The best anchor is found. And the label is assigned to the combination of the cell an anchor. If there is a second object which has the center inside the cell and fits the same anchor, it rewrites the original label. So the first object is not trained to be detected. That confuses the network and reduces its performance. The problem is increased when you use also augmentation that resizes the image into a smaller version. Then the objects are shrunk and it is a higher probability they will be rewritten. That is the issue we want to fix.

d) The problem you are describing happened during inference. Did I take it correctly? We did not think about it, but it seems that your description is correct and makes sense.

Conclusion: it seems there are two places when labels can be rewritten: during training and during inference.

AlexeyAB commented 4 years ago

@Mastemace Hi,

  1. Ok, IoU=0.4

  2. It would be better if you add figure with FPS/scale and Rewritten-label/scale on the same chart, for different scales of output layer: 1/2 - 1/32.

3-4. Yes, you are about training, and I am about detection. Having solved the problem during training, there remains the problem during the detection, which does disrupt your improvements, especially for large objects. Starting with a certain density of objects - large objects will overlap strongly, and close large objects will be fused by NMS.

AlexeyAB commented 4 years ago

@Mastemace

Can you show, what anchors did you use for each: POLY-Yolo and Yolov3?

Mastemace commented 4 years ago

@AlexeyAB Hi:)

a, We use the same anchors for YOLOv3 as for Poly-YOLO. They are generated by k-means. For a model with an input size of (416,832) (h,w), we have anchors: 4,6, 8,10, 16,13, 8,19, 28,22, 14,31, 54,41, 26,60, 108,108 Here, it would be nice to have more anchors and cover better the big objects that are in minority in the dataset and therefore not so interesting for the k-means algorithm.

The situation with rewritten labels is totally different for inference and for training. Let me consider objects A and B that occupy the same cell and anchor.

So, that is the reason why I overestimate the problem of label rewriting. It is a pity to improve the network by Mish, better backbone, etc, and then feed it with not fully correct data. The data are the most important. In our comparison of YOLOv3 and Poly-YOLO, the impact on the mAP is huge (relative 40%) notwithstanding Poly-YOLO has fewer parameters. Respectively, the increase is because label rewriting is reduced and the anchors are better distributed, but the distribution of anchors is for the next talk, we should make clear this one first.

AlexeyAB commented 4 years ago

@Mastemace

a, We use the same anchors for YOLOv3 as for Poly-YOLO. They are generated by k-means. For a model with an input size of (416,832) (h,w), we have anchors: 4,6, 8,10, 16,13, 8,19, 28,22, 14,31, 54,41, 26,60, 108,108

Are these anchors for Cityscapes dataset? For what network resolution it was calculated?

There is no mandatory rule that each layer should have exactly 3 anchors. Have you tried to follow the manual so that each anchor matches its [yolo]-layer? https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so for YOLOv4 the 1st-[yolo]-layer has anchors smaller than 30x30, 2nd smaller than 60x60, 3rd remaining, and vice versa for YOLOv3. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.


Try to train yolov3.cfg with masks:

[yolo]
mask = 8
anchors = 4,6, 8,10, 16,13, 8,19, 28,22, 14,31, 54,41, 26,60, 108,108

[yolo]
mask = 6,7
anchors = 4,6, 8,10, 16,13, 8,19, 28,22, 14,31, 54,41, 26,60, 108,108

[yolo]
mask = 0,1,2,3,4,5
anchors = 4,6, 8,10, 16,13, 8,19, 28,22, 14,31, 54,41, 26,60, 108,108

108x108 = 11664 > 3600 = 60x60

54x41 = 2214 > 900 = 30x30


During inference, A rewrites B. So B is not detected. Conclusion: the network detects fewer objects. The situation can be fully avoided by TTA. So no big deal.

What is TTA?

During training, A rewrites B. So when confidence loss is computed, the loss is decreased when B is not detected. Conclusion: it may happen that B will not be detected during inference notwithstanding B is not rewritten. Or at least, the confidence of B will be lower.

Since receptieve field of YOLO is very high, then:

It general, I agree that these two problems overlap but not the same. A solution to at least one of them will give an improvement. But it is not known:

  1. which solution exactly gives the greatest improvement?
  2. and what is the price of performance for every improvement?

The data are the most important. In our comparison of YOLOv3 and Poly-YOLO, the impact on the mAP is huge (relative 40%) notwithstanding Poly-YOLO has fewer parameters. Respectively, the increase is because label rewriting is reduced and the anchors are better distributed, but the distribution of anchors is for the next talk, we should make clear this one first.

I think such a big difference is due to incorrect setting of masks (improper use of anchors) as I described above.