About yolo mask - Githubissues

Libero-zz commented 5 years ago

Why are you using

[yolo]
mask = 3,4,5
...
[yolo]
mask = 1,2,3

instead of

[yolo]
mask = 3,4,5
...
[yolo]
mask = 0,1,2

Besides, is there any plan of releasing Pelee-PRN? Thanks for the great work.

WongKinYiu commented 5 years ago

The author of YOLOv3 also using

[yolo]
mask = 3,4,5
...
[yolo]
mask = 1,2,3

I just follow the setting of original code.

Currently, we do not have plan to release Pelee-PRN. For Pelee-YOLOv3, you can try our partner's https://github.com/eric612/Yolo-Model-Zoo It trained by Caffe framework https://github.com/eric612/MobileNet-YOLO

Libero-zz commented 5 years ago

Well I can't see any clue of original author using

[yolo]
mask = 1,2,3

I checked out all the .cfg by pjreddie and AlexeyAB, and couldn't find them using duplicate anchors. IMHO, you could get some gains on detecting small objects by using

[yolo]
mask = 0,1,2

because you are finally using the (10,14) anchor, and the drawback may be a little drop on (81,82) one, but it's worth a try.

WongKinYiu commented 5 years ago

@Libero-zz Hello,

You can see the first commit of yolov3-tiny.cfg https://github.com/pjreddie/darknet/commits/master/cfg/yolov3-tiny.cfg

The provided weight file of yolov3-tiny by author is trained using the mask = 1,2,3. https://pjreddie.com/darknet/yolo/

Libero-zz commented 5 years ago

Yeah, the author might make some mistakes https://github.com/pjreddie/darknet/commit/481d9d98abc8ef1225feac45d04a9514935832bf#r28887402) and changed to 0,1,2 later on this commit https://github.com/pjreddie/darknet/commit/f86901f6177dfc6116360a13cc06ab680e0c86b0#diff-2b0e16f442a744897f1606ff1a0f99d3

WongKinYiu commented 5 years ago

For a fair comparison, I should use mask = 1,2,3 & mask = 3,4,5 in yolov3-tiny-PRN. Since the results obtained by yolov3-tiny (33.1% mAP@0.5) is trained using this setting.

For YOLOv3-PRN experiments, I use mask = 0,1,2 & mask = 3,4,5 & mask = 6,7,8. It is also because the author use mask = 0,1,2 & mask = 3,4,5 & mask = 6,7,8 in YOLOv3.

Arcitec commented 5 years ago

@WongKinYiu Okay but mask = 1,2,3 & mask = 3,4,5 is a bug. It's telling darknet to reuse mask 3 twice. The original author fixed it in the original config as @Libero-zz showed.

And I am not sure that the author's original results were with the buggy mask.

Anyway thanks a lot for your research about PRN and design of this tiny yolo config! :-)

WongKinYiu commented 5 years ago

well...as i think, it was a feature, not a bug.

you can simply calculate the the grid size of second pyramid: 416/26=16. it implied this scale can handle the anchor well with size minimal to 16 by 16. and the anchors are 10,14, 23,27, 37,58, 81,82, 135,169, 344,319. (however, the author calculate anchors for size 448 by 448 as i remember)

in my experiments, i have trained yolo-v3-tiny more than five times using both of mask = 1,2,3 & mask = 3,4,5 and mask = 0,1,2 & mask = 3,4,5 on COCO, and mask = 1,2,3 & mask = 3,4,5 always get better results.

Arcitec commented 5 years ago

in my experiments, i have trained yolo-v3-tiny more than five times using both of mask = 1,2,3 & mask = 3,4,5 and mask = 0,1,2 & mask = 3,4,5 on COCO, and mask = 1,2,3 & mask = 3,4,5 always get better results.

Wow, really? Very interesting. How are the results better? (Better mAP%? Faster network?)

Are you doing full training from empty weights, when you compared mask012 vs mask123? (This would matter because if you reuse weights trained with 123, it would obviously be biased for 123).

PS: This is weird because mask = 0,1,2 was clearly what the author @pjreddie wanted to type, since he "fixed" that in a commit... :-S

WongKinYiu commented 5 years ago

better mAP.

also, u can find the results of yolo-v3-tiny on your posted link. https://www.youtube.com/watch?v=j5WstN4VWVU the bounding boxes of small objects quite smaller than real objects. it becuz the weight file is trained using 1,2,3 but test cfg file using 0,1,2.

i trained the models from imagenet pretrained weight.

yes, the paper says "our system only assigns one bounding box prior for each ground truth object", so mask = 0,1,2 is more reasonable.

Arcitec commented 5 years ago

Ohh okay!

For any future readers, we're talking in two tickets at the same time now... and here's the thing WongKinYiu is referring to: https://github.com/AlexeyAB/darknet/issues/3380#issuecomment-542217703

So the yolo-v3-tiny (official) results are bad because the weights were trained with mask = 1,2,3 and the config file was later "corrected" to mask = 0,1,2 but that means the network config and the weights are different so the bounding boxes are bad... Wow, big oops by @pjreddie... If he's still active on github I would love if he can reply here. The official "tiny" imagenet weights file should be re-trained with a fixed 012 config...

yes, the paper says "our system only assigns one bounding box prior for each ground truth object", so mask = 0,1,2 is more reasonable.

Ah yes the YOLO paper confirms that mask = 1,2,3 was a typo/bug.

better mAP. i trained the models from imagenet pretrained weight.

Okay so you used transfer learning. My worry is that if you have a pretrained weight made with 123, and you transfer learn it to a 012 "fixed" net, you still have some bias towards 123 which means 123 would get better mAP...

So the final question I have is: Which imagenet pretrained weight did you use? The @pjreddie file with mask=123, or did you train your own "prn" imagenet weights from empty weights first?

Because I am super interested in knowing what YOLOv3-Tiny-PRN's mAP performance is with mask=0,1,2 with imagenet weights trained with mask=0,1,2 too. Maybe the accuracy goes up even more when the "mask bug" is "fixed"?! ;-)

WongKinYiu commented 5 years ago

we need not to re-train imagenet for different mask setting.

you can follow https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects OR using the weight file of Darknet Reference instead of yolov3-tiny.weights

Arcitec commented 5 years ago

@WongKinYiu Yeah we don't "need" to retrain it and we can still load the same weights with different mask settings...

But doesn't the mAP go down if we re-use old weights that were trained with a different mask? Because those weights are tuned to provide neural network outputs that give good mAP results with the originally trained mask, and very bad results if you change the cfg.

Exactly as you showed me... the Tiny-v3 bounding boxes became wrong in that video we looked at when a different mask 0,1,2 was used instead of the masks the weights were trained on 1,2,3... at least that's what I thought you said...

So all I am saying is: If we change mask=0,1,2 to "fix" the cfg file, I think we also need to fully re-train imagenet so that the core weights are tuned for that "fixed" mask setting.

Here's my theory/understanding so far:

cfg mask=1,2,3 + weights mask=1,2,3 = high mAP <---
cfg mask=1,2,3 + weights mask=0,1,2 = bad mAP (bad boxes)
cfg mask=0,1,2 + weights mask=0,1,2 = high mAP <---
cfg mask=0,1,2 + weights mask=1,2,3 = bad mAP (bad boxes)

Right now we only have weights files trained with weights mask=1,2,3... so we can't change the mask value in the config. I think we need to re-train the imagenet weights with weights mask=0,1,2... This should fix the mAP issues... and may even give better mAP since there's an extra mask (0th mask) being available when the config is fixed...

What do you think? Or is my understanding here wrong?

WongKinYiu commented 5 years ago

cfg mask=1,2,3 + weights mask=1,2,3 = high mAP <---
cfg mask=1,2,3 + weights mask=0,1,2 = bad mAP (bad boxes)
cfg mask=0,1,2 + weights mask=0,1,2 = high mAP <---
cfg mask=0,1,2 + weights mask=1,2,3 = bad mAP (bad boxes)

these are correct, but we need to fully re-train COCO, not ImageNet. we did't considerate the masks when training ImageNet.

Arcitec commented 5 years ago

@WongKinYiu Ohhhh, okay, really sorry for mixing up the words ImageNet and COCO, I forgot that your model weights were trained on COCO (with transfer learning from the 123 ImageNet weights)! My bad. :-/

But yeah, all of this makes me wonder if fully re-training the COCO weights with 012 will make your PRN network perform even better... it depends on what the extra mask (0) will give us! More mAP perhaps? Are you curious to do this testing? I have a RTX2070 laptop, not really good enough for fully retraining a net fast enough, so I always do transfer learning myself.

Arcitec commented 5 years ago

Okay according to @pjreddie the mask controls which bounding boxes the layer is responsible for detecting.

The lower mask numbers are the smallest objects.

So the answer is: If we fix mask = 0,1,2 we will get better detection of small objects (because we enable object detection of bounding box 0, which is the smallest box and currently isn't being used at all in the broken config), and overall we will get higher mAP because two layers won't overwrite each other's box-mask anymore (box 3 is written to by two layers by the broken config).

https://github.com/pjreddie/darknet/issues/558

Arcitec commented 5 years ago

Okay, we just had a conversation in a different thread and it emerged that you've already trained COCO on both 012 and 123: https://github.com/WongKinYiu/PartialResidualNetworks/issues/7#issuecomment-544954638

I would have to check the darknet source code, to know for sure, but... the 123 mask has two problems:

The 0th (number 1 of 6, smallest objects) anchor mask is not activated at all, so smallest objects are not detected.
The 3rd (number 4 of 6) anchor mask is assigned to both [yolo] layers, which internally most likely means that the mask is assigned to ONE of the layers.
So the first YOLO layer activates anchor 1 (2 of 6), anchor 2 (3 of 6), and POSSIBLY also anchor 3 (4 of 6).
The second YOLO layer activates anchor 4 (5 of 6) and anchor 6 (6 of 6). And PROBABLY also anchor 3 (4 of 6) if the first YOLO layer doesn't get ownership of it. (My theory is that Darknet's parser.c assigns the mask to the second YOLO layer since it parses the config from first to last line).
So the first YOLO layer (for smallest objects) uses 2-3 anchors, and the second YOLO layer (for larger objects) uses 2-3 anchors. 1 anchor is totally unused.

Alright so with that technical stuff explained... if you're getting better mAP by losing the 0th anchor (smallest items), perhaps that means that YOLO is IMPROVED by removing that anchor...

Arcitec commented 5 years ago

Parses mask=... cfg text into an array of integers:

https://github.com/AlexeyAB/darknet/blob/master/src/parser.c#L318-L337

Creates a yolo layer with mask param set to those masks:

https://github.com/AlexeyAB/darknet/blob/master/src/yolo_layer.c#L13

The rest of yolo_layer.c seems to use the mask value to load the biases box for the anchor at position N, where N is the mask number.

So my earlier guess was wrong. The layers do not overwrite each others maskresult. They simply load the same anchor (bias). No overwriting as far as I can see. Each YOLO layer has its own independent array-copy of the anchor integers and mask integers.

So, what a cfg with "mask = 123 and mask = 345" does is simply this:

Smallest object anchor mask (0) is not loaded at all
mask 3 (number 4 of 6) is loaded by both Yolo layers.

So if my incomplete analysis of the code is correct, there is no problem reusing mask except that you lose the smallest object size detections. (You lose the smallest anchor size completely).

And instead, you reuse anchor mask 3 "medium size object" twice, in both YOLO layers.

It's very possible that this improves accuracy because it lets both layers detect medium size object. So perhaps this actually optimizes the network for average-sizd objects (but it loses small objects, but maybe they don't matter)...

Arcitec commented 5 years ago

If this code analysis is correct, both of these configs are identical:

A:

[yolo]
anchors = 10,10  20,20,  30,30,  40,40,  50,50,  60,60
mask = 1,2,3 # uses 20x20, 30x30, 40x40

[yolo]
anchors = 10,10  20,20,  30,30,  40,40,  50,50,  60,60
mask = 3,4,5 # uses 40x40, 50x50, 60x60

B:

[yolo]
anchors = 20,20,  30,30,  40,40,  40,40,  50,50,  60,60
mask = 0,1,2 # uses 20x20, 30x30, 40x40

[yolo]
anchors = 20,20,  30,30,  40,40,  40,40,  50,50,  60,60
mask = 3,4,5 # uses 40x40, 50x50, 60x60

Arcitec commented 5 years ago

Yes I found the missing puzzle piece now:

https://github.com/AlexeyAB/darknet/blob/master/src/parser.c#L376-L389

Parser.c translates anchors= cfg into l.biases layer property array.

And translates mask=cfg into an array of biases array offsets.

So yes these configs are identical:

https://github.com/WongKinYiu/PartialResidualNetworks/issues/2#issuecomment-544971502

What do you think about that @WongKinYiu? You've discovered a way to increase mAP by ignoring the smallest anchor and reusing the medium-size anchor. So perhaps this should be proposed to @AlexeyAB as a better way to calculate anchors (in darknets "calculate anchors" mode). But probably with a slightly better medium-box (instead of reusing the same box twice). The first yolo layer should probably have a box between the sizes of box 2 and box 3.

—

PS: The process can probably be automated by telling darknet "calculate anchors" mode to generate 7 anchors, then deleting the smallest anchor so you have 6 anchors. Then setting first layer to 0,1,2 and second layer to 3,4,5. This should give the best anchors, by throwing away the smallest (most useless) anchor, and giving us a good final "medium-size" anchor on the first layer instead of reusing the second layer's medium anchor size.

It will probably improve mAP even more, because the layers will have more detailed anchor boxes to use.

Arcitec commented 5 years ago

@WongKinYiu Here's a .patch to fix the cfg file:

$ diff -u orig.txt  yolov3-tiny-prn.cfg
--- orig.txt    2019-10-22 17:03:11.646135000 +0200
+++ yolov3-tiny-prn.cfg 2019-10-22 16:58:00.947191300 +0200
@@ -139,7 +139,7 @@

 [yolo]
 mask = 3,4,5
-anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
+anchors = 23,27,  37,58,  81,82,  81,82,  135,169,  344,319
 classes=80
 num=6
 jitter=.3
@@ -189,8 +189,8 @@
 activation=linear

 [yolo]
-mask = 1,2,3
-anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
+mask = 0,1,2
+anchors = 23,27,  37,58,  81,82,  81,82,  135,169,  344,319
 classes=80
 num=6
 jitter=.3

Behavior is identical, anchors are identical, but now the mask values are proper.

AlexeyAB commented 5 years ago

@VideoPlayerCode

What do you think about that @WongKinYiu? You've discovered a way to increase mAP by ignoring the smallest anchor and reusing the medium-size anchor. So perhaps this should be proposed to @AlexeyAB as a better way to calculate anchors (in darknets "calculate anchors" mode). But probably with a slightly better medium-box (instead of reusing the same box twice). The first yolo layer should probably have a box between the sizes of box 2 and box 3.

It is just a bug in pjreddie repo. It doesn't increase accuracy. @WongKinYiu used it just for fair comparison yolov3-prn with the yolov3-tiny model. https://github.com/WongKinYiu/PartialResidualNetworks/issues/2#issuecomment-529834587 mask = 1,2,3 is a bug and it must be fixed to mask = 0,1,2 in any cases. Should be used

cfg mask=0,1,2 + weights mask=0,1,2 = high mAP <---

WongKinYiu commented 5 years ago

@VideoPlayerCode Hello,

we should not ~remove~ adjust smallest anchor without reason. @AlexeyAB is correct, the reason i use mask = 1,2,3 is mainly for fair comparison.

for yolov3-tiny, it has two feature pyramid (13x13 and 26x26 grids for 416x416 input). it is to say, for anchorss = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319, 10x14 is smaller than 16x16 (416/26=16). change 10,14 to anchor lager than 16,16 is better for this case.

for yolov3, it has three feature pyramid (13x13, 26x26, and 52x52). and anchors are 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326. here you can see:

52x52 (416/52=8) using 10,13, 16,30, 33,23
26x26 (416/16=16) using 30,61, 62,45, 59,119
13x13 (416/13)=32) using 116,90, 156,198, 373,326

we do not need to adjust smallest anchor for this case.

the conclusion is that we should adjust and reassign anchors to proper grid scale. different grid scale can be assigned by different number of anchors. grid scale also response to object density for detecting. for example, assign 32x32 anchor to 52x52 grid of 416x416 input, you can detect object with 75% overlap well (416/52=8, 32x32 with stride 8 overlap is 75%). however, if assign 10x14 to grid 26x26, there may contains multiple objects in a grid even though the objects have no overlap. in my opinion, for a grid response to k x k of input, anchor k x k is acceptable smallest size, and 2k x 2k to 4k x 4k is better choices. in yolov3 case

52x52 (416/52=8) using 10,13, 16,30, 33,23 → 16~32
26x26 (416/16=16) using 30,61, 62,45, 59,119 → 32~64
13x13 (416/13)=32) using 116,90, 156,198, 373,326 → 64~128

156,198 and 373,326 are larger than 128x128, so if we have grid scale equal to 7x7 and 3x3 is better for this case. what fpn do that or just make receptive field larger as what the author do in yolov3-spp.

Arcitec commented 5 years ago

Hello @AlexeyAB, thanks for answering in this topic.

It is just a bug in pjreddie repo. It doesn't increase accuracy.

That's what @WongKinYiu said though. Here: https://github.com/WongKinYiu/PartialResidualNetworks/issues/7#issuecomment-544785382

"however, for coco dataset, i use same imagenet pre-trained model, mask = 1,2,3 get better mAP than mask = 0,1,2."

He said he trained TWO models, mask=0,1,2 and mask=1,2,3, both fully trained on coco from scratch (with pjreddie's imagenet weights as base). Meaning two independent trainings. And that deleting mask 0 gave better mAP.

@WongKinYiu used it just for fair comparison yolov3-prn with the yolov3-tiny model. #2 (comment)

That's what I thought too, but in the link above he claims it gives better mAP period, even when he tried a mask=0,1,2 model properly trained with 012 weights.

mask = 1,2,3 is a bug and it must be fixed to mask = 0,1,2 in any cases. Should be used

cfg mask=0,1,2 + weights mask=0,1,2 = high mAP <---

Yes it is a bug. mask=1,2,3 means 3rd anchor size is used in two layers, and 0th anchor is not used at all.

@WongKinYiu Hello again. :-) You said "we should not remove smallest anchor without reason." but this patch (https://github.com/WongKinYiu/PartialResidualNetworks/issues/2#issuecomment-545007129) was just meant to show you that by setting mask=1,2,3 your config already removes the smallest anchor. My patch anchors are 100% the same as the cfg you released. Because your config does not use anchor 0 at all. So I just edited the anchors to fix mask=0,1,2 (so that other people don't train their own model from that value), and at the same time show what the anchors truly are. ;-)

Btw, what you're showing above sounds interesting. Is that some new ideas for optimizing anchors, that can be put into darknet's detector calc_anchors command/algorithm?

AlexeyAB commented 5 years ago

I tried to use overlapped anchors int this model yolo_v3_tiny_pan3.cfg.txt https://github.com/AlexeyAB/darknet/files/3580764/yolo_v3_tiny_pan3_aa_ae_mixup_scale_giou.cfg.txt

mask = 0,1,2,3,4 mask = 4,5,6,7,8 mask = 8,9,10,11

More: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494148968

Arcitec commented 5 years ago

@AlexeyAB Yep overlapping anchors is valid since the mask just controls which anchor sizes the layer will try.

But the problem here is different; that @WongKinYiu said that by deleting (not using) the smallest anchor, mAP increases.

WongKinYiu / PartialResidualNetworks

About yolo mask #2