Open marvision-ai opened 3 years ago
Is there a way to do this on the yolov4-tiny models? Or is this specific to yolo-v4 only?
Note I asked this exact same question a while back in issue #6274. Curious to know what the answer is.
Meanwhile, if you are looking for small objects, you may also want to look at yolov4-tiny-3l, which is similar to yolov4-tiny but has 3 YOLO layers instead of 2.
@stephanecharette yes this is using yolov4-tiny-3l . 😊
I'm just surprised custom anchors doesn't help as much as I expected.
How small are your objects, what sizes are your images, and what sizes are you using for the network?
Images : 1120 x 960 Network size : 1120 x 960 Object sizes : range from 8x10 -> 25x25 and everything in between.
In the project I just finished last week, I was detecting objects that were between 13x13 and 30x30. At the low end I was worried about the images with the tiny bounding boxes, but using YOLOv4-tiny-3l they turned out great. Some images had ~150 of those tiny objects per image. Example crop:
Now I wasn't using the whole image. From the original 1280x960 image I crop a specific 832x352 RoI and, and the neural network dimension is 832x352. Don't know if that makes a difference, as I've never tried to change the anchors in any network I've trained.
@stephanecharette looks awesome! I agree, my tiny model works great but I'm trying to push the bounds and limits. See what's the true capabilities of the network you know?
I pose this question to understand why custom anchors doesn't really make a big difference...
Question for you: how do you deal with overlaying boxes in the crop regions?
What do you mean "deal with"? Darknet should handle it just fine. Here is an example image:
And this is what it looks like after detection and annotation:
@stephanecharette very cool! It's good to see how it can detect the two objects that have partial occlusion.
I guess we will wait to see if @AlexeyAB can shed some light on pushing the detection accuracy further.
Hi @stephanecharette ,
Great results, well done!
Could I ask you a couple of questions please? I can see you used the image size of 832 x 352. Am I right in thinking that if you pick the size like this, darknet automatically pads image, so that it becomes a square? So far I've been picking equal width height sizes, probably I was wrong. I am currently working on a project at work where we want to detect company logos on TV, they tend to be quite small and occupy only 1-5% of the total frame area. I am currently training the basic v4 tiny for the job (not the upgraded one with 3 yolo layers because the logos are not too small, might try it later), with the image size of 608 608. The resolution we are working with is 960 536. Do you reckon I could train the net for the image size we are actually working with (960 356)? So far I've been thinking that if I pick an image size say 608 608 and train on rectangular images, darknet will pad them keeping the aspect ratio intact. Now I am curious if your approach is better and if yes, why is it so. Would be interested to know your opinion.
Thanks.
P.S. @AlexeyAB thanks for your hard work. You cant even imagine how many people use your work. Spasibo :)
I can see you used the image size of 832 x 352. Am I right in thinking that if you pick the size like this, darknet automatically pads image, so that it becomes a square?
No, you can define your neural network to be whatever size you want, as long as the width and heights are divisible by 32. So I define the network to be 832x352, and my images are also 832x352, so there is no resizing required. See the [net]
section of the .cfg where the width and height are defined.
If your images do not match the network size, then Darknet resizes them. Aspect ratio is NOT kept when resizing the images, unless you have enabled the old "letterbox" option. But that option isn't really used by many people anymore, most often the images are simply resized regardless of the aspect ratio. I've seen some issues raised recently where certain new features don't work with the "letterbox" option because no-one has tested it.
So far I've been picking equal width height sizes, probably I was wrong.
For the longest time I also through the images had to be square. This is not explained very well (at all?) in the readme.
with the image size of 608 608. The resolution we are working with is 960 536. Do you reckon I could train the net for the image size we are actually working with (960 356)?
You cannot use 960x356, but you could use 960x352. Remember both values have to be a multiple of 32.
So far I've been thinking that if I pick an image size say 608 608 and train on rectangular images, darknet will pad them keeping the aspect ratio intact. Now I am curious if your approach is better and if yes, why is it so. Would be interested to know your opinion.
I was particularly worried because so many of the items this customer needed me to find/identify were very small, around 13x13 pixels, which I knew was going to stretch the limits of Darknet/YOLO. So I wanted as little resizing as possible, and I was certain I didn't want the objects to be stretched in either direction, which is why I chose to crop to the RoI I mention above.
Hi @stephanecharette ,
Thanks for your reply!
So, are you saying that when we use YOLO for inference, during image/frame preprocessing we should not be resizing images keeping the aspect ratio by padding them? Should we just resize them to the size the network was trained on?
Thanks a lot!
E
What I'm saying is if the image size doesn't match the network size, then Darknet will automatically resize each image/frame as it processes it. And when it resizes it, Darknet ignores the aspect ratio and stretches it whatever way it must to make it fit.
So if you have very precise images you are using, for example in a controlled environment on a factory floor (like what I was doing) then you may as well attempt to match the image sizes and network size. This will ensure the most accurate and fastest processing, as no time is wasted resizing each frame.
If you are releasing general-purpose weight files and configurations which are then used by people with their webcams, dashcams, DSLRs, etc, and all of them have different sizes and aspect ratios, then pick some reasonable values and live with the fact that Darknet will be resizing images. Personally, I find it strange that the Darknet default is a perfectly square image, as no consumer-grade camera of any sort that I know of has a 1:1 aspect ratio. (Some high-end commercial cameras are 1:1.)
I suspect the current Darknet defaults may be due to some standard image data sets which uses square images.
I agree with what @stephanecharette mentioned. I use it for the same purposes as him, and always train my networks to match the size I'm inferring at. This has always given me the best accuracy and most robust inference at test time.
Thanks for your replies @stephanecharette and @marvision-ai
Question for you: how do you deal with overlaying boxes in the crop regions?
What do you mean "deal with"? Darknet should handle it just fine. Here is an example image:
And this is what it looks like after detection and annotation:
Hello, I am also doing small target detection recently, but the training effect is not very good, is it a problem of data set ![Uploading 图片.png…]()
Hello @AlexeyAB thank you for such a great repo. I have a quick question:
I am in the process of detecting
4 types small objects
. I have been going through all the extra steps to increase performance.I calculated these custom achors:
anchors = 9, 11, 17, 17, 15, 65, 31, 34, 41, 61, 44,121, 88, 74, 99,123, 180,144
Custom anchors
I took what you said, and applied it as such to my
.cfg
but I am not getting much of an increase (1%) performance compared to the original anchors.Here is my
.cfg
portion: I changed thefilters=(classes + 5)*<number of mask>
and I made sure to go based on the largest achors in the first layer, and the smallest anchors in the last.3 Questions:
1. The mAP barely improves. Is there something I did not implement correctly?
2. Is there a reason we are detecting the largest anchors first ( >60x60) --> (>30x30) --> (<30x30) ? I read somewhere that this order does not matter.
3. In the case of the
( 9, 11)
anchor, should I just ignore that (too small) and just have the last layer showmask = 1
?I also want to implement the following suggestions:
Is there a way to do this on the yolov4-tiny models? Or is this specific to yolo-v4 only?