Custom achors and small objects with yolov4-tiny-3l

marvision-ai commented 3 years ago

Hello @AlexeyAB thank you for such a great repo. I have a quick question:

I am in the process of detecting 4 types small objects. I have been going through all the extra steps to increase performance.

I calculated these custom achors: anchors = 9, 11, 17, 17, 15, 65, 31, 34, 41, 61, 44,121, 88, 74, 99,123, 180,144

Custom anchors

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so for YOLOv4 the 1st-[yolo]-layer has anchors smaller than 30x30, 2nd smaller than 60x60, 3rd remaining, and vice versa for YOLOv3. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

I took what you said, and applied it as such to my .cfg but I am not getting much of an increase (1%) performance compared to the original anchors.

Here is my .cfg portion: I changed the filters=(classes + 5)*<number of mask> and I made sure to go based on the largest achors in the first layer, and the smallest anchors in the last.

[convolutional]
size=1
stride=1
pad=1
filters=36
activation=linear

[yolo]
mask = 5,6,7,8
anchors = 9, 11,  17, 17,  15, 65,  31, 34,  41, 61,  44,121,  88, 74,  99,123, 180,144
classes=4
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 23

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=27
activation=linear

[yolo]
mask = 2,3,4
anchors = 9, 11,  17, 17,  15, 65,  31, 34,  41, 61,  44,121,  88, 74,  99,123, 180,144
classes=4
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6

[route]
layers = -3

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 15

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 0,1
anchors = 9, 11,  17, 17,  15, 65,  31, 34,  41, 61,  44,121,  88, 74,  99,123, 180,144
classes=4
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6

3 Questions:

1. The mAP barely improves. Is there something I did not implement correctly?

2. Is there a reason we are detecting the largest anchors first ( >60x60) --> (>30x30) --> (<30x30) ? I read somewhere that this order does not matter.

3. In the case of the `( 9, 11)` anchor, should I just ignore that (too small) and just have the last layer show `mask = 1` ?

I also want to implement the following suggestions:

for training for small objects (smaller than 16x16 after the image is resized to 416x416) - set layers = 23 instead of https://github.com/AlexeyAB/darknet/blob/6f718c257815a984253346bba8fb7aa756c55090/cfg/yolov4.cfg#L895

set stride=4 instead of https://github.com/AlexeyAB/darknet/blob/6f718c257815a984253346bba8fb7aa756c55090/cfg/yolov4.cfg#L892 set stride=4 instead of https://github.com/AlexeyAB/darknet/blob/6f718c257815a984253346bba8fb7aa756c55090/cfg/yolov4.cfg#L989

Is there a way to do this on the yolov4-tiny models? Or is this specific to yolo-v4 only?

stephanecharette commented 3 years ago

Is there a way to do this on the yolov4-tiny models? Or is this specific to yolo-v4 only?

Note I asked this exact same question a while back in issue #6274. Curious to know what the answer is.

Meanwhile, if you are looking for small objects, you may also want to look at yolov4-tiny-3l, which is similar to yolov4-tiny but has 3 YOLO layers instead of 2.

marvision-ai commented 3 years ago

@stephanecharette yes this is using yolov4-tiny-3l . 😊

I'm just surprised custom anchors doesn't help as much as I expected.

stephanecharette commented 3 years ago

How small are your objects, what sizes are your images, and what sizes are you using for the network?

marvision-ai commented 3 years ago

Images : 1120 x 960 Network size : 1120 x 960 Object sizes : range from 8x10 -> 25x25 and everything in between.

stephanecharette commented 3 years ago

In the project I just finished last week, I was detecting objects that were between 13x13 and 30x30. At the low end I was worried about the images with the tiny bounding boxes, but using YOLOv4-tiny-3l they turned out great. Some images had ~150 of those tiny objects per image. Example crop:

Now I wasn't using the whole image. From the original 1280x960 image I crop a specific 832x352 RoI and, and the neural network dimension is 832x352. Don't know if that makes a difference, as I've never tried to change the anchors in any network I've trained.

marvision-ai commented 3 years ago

@stephanecharette looks awesome! I agree, my tiny model works great but I'm trying to push the bounds and limits. See what's the true capabilities of the network you know?

I pose this question to understand why custom anchors doesn't really make a big difference...

stephanecharette commented 3 years ago

Question for you: how do you deal with overlaying boxes in the crop regions?

What do you mean "deal with"? Darknet should handle it just fine. Here is an example image:

And this is what it looks like after detection and annotation:

marvision-ai commented 3 years ago

@stephanecharette very cool! It's good to see how it can detect the two objects that have partial occlusion.

I guess we will wait to see if @AlexeyAB can shed some light on pushing the detection accuracy further.

EvgeniiTitov commented 3 years ago

Hi @stephanecharette ,

Great results, well done!

Could I ask you a couple of questions please? I can see you used the image size of 832 x 352. Am I right in thinking that if you pick the size like this, darknet automatically pads image, so that it becomes a square? So far I've been picking equal width height sizes, probably I was wrong. I am currently working on a project at work where we want to detect company logos on TV, they tend to be quite small and occupy only 1-5% of the total frame area. I am currently training the basic v4 tiny for the job (not the upgraded one with 3 yolo layers because the logos are not too small, might try it later), with the image size of 608 608. The resolution we are working with is 960 536. Do you reckon I could train the net for the image size we are actually working with (960 356)? So far I've been thinking that if I pick an image size say 608 608 and train on rectangular images, darknet will pad them keeping the aspect ratio intact. Now I am curious if your approach is better and if yes, why is it so. Would be interested to know your opinion.

Thanks.

P.S. @AlexeyAB thanks for your hard work. You cant even imagine how many people use your work. Spasibo :)

stephanecharette commented 3 years ago

I can see you used the image size of 832 x 352. Am I right in thinking that if you pick the size like this, darknet automatically pads image, so that it becomes a square?

No, you can define your neural network to be whatever size you want, as long as the width and heights are divisible by 32. So I define the network to be 832x352, and my images are also 832x352, so there is no resizing required. See the [net] section of the .cfg where the width and height are defined.

If your images do not match the network size, then Darknet resizes them. Aspect ratio is NOT kept when resizing the images, unless you have enabled the old "letterbox" option. But that option isn't really used by many people anymore, most often the images are simply resized regardless of the aspect ratio. I've seen some issues raised recently where certain new features don't work with the "letterbox" option because no-one has tested it.

So far I've been picking equal width height sizes, probably I was wrong.

For the longest time I also through the images had to be square. This is not explained very well (at all?) in the readme.

with the image size of 608 608. The resolution we are working with is 960 536. Do you reckon I could train the net for the image size we are actually working with (960 356)?

You cannot use 960x356, but you could use 960x352. Remember both values have to be a multiple of 32.

So far I've been thinking that if I pick an image size say 608 608 and train on rectangular images, darknet will pad them keeping the aspect ratio intact. Now I am curious if your approach is better and if yes, why is it so. Would be interested to know your opinion.

I was particularly worried because so many of the items this customer needed me to find/identify were very small, around 13x13 pixels, which I knew was going to stretch the limits of Darknet/YOLO. So I wanted as little resizing as possible, and I was certain I didn't want the objects to be stretched in either direction, which is why I chose to crop to the RoI I mention above.

EvgeniiTitov commented 3 years ago

Hi @stephanecharette ,

Thanks for your reply!

So, are you saying that when we use YOLO for inference, during image/frame preprocessing we should not be resizing images keeping the aspect ratio by padding them? Should we just resize them to the size the network was trained on?

Thanks a lot!

E

stephanecharette commented 3 years ago

What I'm saying is if the image size doesn't match the network size, then Darknet will automatically resize each image/frame as it processes it. And when it resizes it, Darknet ignores the aspect ratio and stretches it whatever way it must to make it fit.

So if you have very precise images you are using, for example in a controlled environment on a factory floor (like what I was doing) then you may as well attempt to match the image sizes and network size. This will ensure the most accurate and fastest processing, as no time is wasted resizing each frame.

If you are releasing general-purpose weight files and configurations which are then used by people with their webcams, dashcams, DSLRs, etc, and all of them have different sizes and aspect ratios, then pick some reasonable values and live with the fact that Darknet will be resizing images. Personally, I find it strange that the Darknet default is a perfectly square image, as no consumer-grade camera of any sort that I know of has a 1:1 aspect ratio. (Some high-end commercial cameras are 1:1.)

I suspect the current Darknet defaults may be due to some standard image data sets which uses square images.

marvision-ai commented 3 years ago

I agree with what @stephanecharette mentioned. I use it for the same purposes as him, and always train my networks to match the size I'm inferring at. This has always given me the best accuracy and most robust inference at test time.

EvgeniiTitov commented 3 years ago

Thanks for your replies @stephanecharette and @marvision-ai

WANGCHAO1996 commented 3 years ago

Question for you: how do you deal with overlaying boxes in the crop regions?

What do you mean "deal with"? Darknet should handle it just fine. Here is an example image:

And this is what it looks like after detection and annotation:

Hello, I am also doing small target detection recently, but the training effect is not very good, is it a problem of data set ![Uploading 图片.png…]()

AlexeyAB / darknet