ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% mAP@0.5 with 45.5FPS

Kyuuki93 commented 4 years ago

Learning Spatial Fusion for Single-Shot Object Detection

paper https://arxiv.org/pdf/1911.09516.pdf
code https://github.com/ruinmessi/ASFF

@AlexeyAB it's seems worth to take a look

AlexeyAB commented 4 years ago

@Kyuuki93

Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: #4507 (comment)

By change activation = linear to activation = leaky?

You can use activation = linear for conv-layers.

Generally by adding [maxpool] maxpool_depth=1:

[route]
layers = -1,-5,-9,-12

[maxpool]
maxpool_depth=1
out_channels=64
size=1
stride=1

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=linear

[shortcut]
from=-16
activation=leaky

sctrueew commented 4 years ago

@Kyuuki93 @AlexeyAB Hi,

When can we use the ASFF? please share the final cfg. have you tried to compare ASFF and csresnext50-panet-spp?

Kyuuki93 commented 4 years ago

@zpmmehrdad yolov3-spp + ASFF yolov3-spp-asff-it.cfg.txt yolov3-spp + ASFF +Dropblock + RFB(bn=0)yolov3-spp-asff-db-it-rfb.cfg.txt

you can set bn=1 in RFB blocks to get better results.

have you tried to compare ASFF and csresnext50-panet-spp?

Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion

sctrueew commented 4 years ago

@Kyuuki93 Thanks for your reply,

I saw the table that you shared and in the table "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" has a good result at AP@.75. I'm going to use for ~200 classes and some classes are almost the same and AP@.75 important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?

Thanks

Kyuuki93 commented 4 years ago

I'm going to use for ~200 classes and some classes are almost the same and AP@.75 important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?

Try to compare spp,giou,it=0.213,asff(softmax),rfb(bn=0) and spp,giou,it=0.213,asff(softmax),rfb(bn=1), this ASFF module wasn't get enough test, I'm not sure it can imporove AP@.75 at every dataset, I used it on a one-class datatset, so if you got results please share with us

AlexeyAB commented 4 years ago

@Kyuuki93

Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion

Look at this comparison: https://github.com/AlexeyAB/darknet/issues/4406#issuecomment-567919052

For GPUs without Tensor Cores ResNext50 was better than Darknet53, but for Volta/Turing (RTX) GPUs and newer, it seems that Darknet53 is better. So may be we should use CSPDarkNet-53 backbone rather than CSPResNeXt-50 https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models
Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF

Kyuuki93 commented 4 years ago

So may be we should use CSPDarkNet-53 backbone rather than CSPResNeXt-50 https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models

It seems worth to try

Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF

I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.

Training from scratch is a little bit slow..

Kyuuki93 commented 4 years ago

@AlexeyAB Also, take a look of this https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075, it's seem gaussian_yolo hurt recall heavily, and low iou_thresh can significant improve it.

And next I want to find out the relation between precision ,reall and ignore_thresh, truth_thresh

AlexeyAB commented 4 years ago

@Kyuuki93

[Gaussian_yolo] introduces bbox_confidence_score = (0 - 1), so confidence_score = class_conf * bbox_conf will be lower than confidence_score = class_conf in [yolo] - it decreases the number of bboxes with thresh > conf_thresh - it increases Precision and decreases Recall for the same conf_threshold
iou_thresh=0.213 allows Yolo to use many not the most suitable anchors for one object - it increases the number of bboxes (but additional bboxes are not accurate) - it increases Recall and decreases Precision for the same conf_threshold

Model	AP@.5	precision (th=0.85)	recall(th=0.85)	precision (th=0.7)	recall(th=0.7)
spp,mse	89.50%	0.98	0.20	0.97	0.36
spp,giou	90.09%	0.98	0.25	0.97	0.40
spp,ciou	89.88%	0.99	0.22	0.97	0.38
spp,giou,gs	91.39%	0.99	0.05	0.97	0.47
spp,giou,gs,it	91.87%	0.99	0.16	0.97	0.52

AlexeyAB commented 4 years ago

I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?

Yes. You can just get features from these 3 points (P3, P4, P5). And use NORM_CHAN_SOFTMAX
Or you can get features from earlier points (figure below)
- or from 4 points P2 - P5 instead of P3 - P7 (as it is done in PANnet https://github.com/AlexeyAB/darknet/issues/3175 )
- or from 5 points P1 - P5 instead of P3 - P7.

And you can duplicate BiFPN-block many times (from 2 to 8 BiFPN blocks) - page 5, table 1: https://arxiv.org/pdf/1911.09070v1.pdf

69334457-cab33e80-0c6b-11ea-8214-8f38d5aeda88

AlexeyAB commented 4 years ago

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.

I did not go into details how the Senet should look in details. Several questions:

Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?
masks= of [yolo] layers should be fixed for P3, P2, P4 sizes
Why did you remove top 2 blocks (P5, P6)?

2 shortcut layers are pointed to the same layer-5

Is it normal that some of your layers have 19 BFLOPS?

Kyuuki93 commented 4 years ago

Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?

There no public spinenet in any framework yet, this cfg was based on

masks= of [yolo] layers should be fixed for P3, P2, P4 sizes

I did not notice that, will change then

Why did you remove top 2 blocks (P5, P6)?

Feature map in P5, P6 were too small I think

2 shortcut layers are pointed to the same layer-5

layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky layer-10 was input of 3rd bottleneck, with activation=linear

Is it normal that some of your layers have 19 BFLOPS?

I checked the model, and compare the ratio here

I think this 19 BFLOPS was right, and it's result of use residual block instead of bottleneck block, which is diamond block in previous figure

AlexeyAB commented 4 years ago

2 shortcut layers are pointed to the same layer-5

layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky layer-10 was input of 3rd bottleneck, with activation=linear

What do you mean? Do you mean this is correct?

#               # 9 b2
[shortcut]
from=-4
activation=leaky
#               # 10 b3 3rd gray rectangle block
#               # from b1,b2
[shortcut]
from=-5
activation=leaky

Kyuuki93 commented 4 years ago

What do you mean? Do you mean this is correct?

#               # 9 b2
[shortcut]
from=-4
activation=leaky
#               # 10 b3 3rd gray rectangle block
#               # from b1,b2
[shortcut]
from=-5
activation=leaky -> should be linear

My mistake, but two shortcut from layer-5 is correct

Kyuuki93 commented 4 years ago

@AlexeyAB this spinenet which has one shortcut layer set wrong activation function and use 3 yolo on P2, P2, P3, got 88.80% AP@.5 and 52.26% AP@.75 in previous one-class dataset, training from scratch with setting

width,height = 384,384
random=0
iou_loss=giou
iou_thresh=0.213

but 1.5x slower than yolov3-spp, I will take more test then

AlexeyAB commented 4 years ago

@Kyuuki93 Try to train both yolov3-spp and fixed-spinenet without pre-trained weights and with the same other settings.

AlexeyAB commented 4 years ago

Seems counters_per_class not work out data unbalance issue, compare with class weights in classification problem, classify loss were produced by class only, but in detection problem, loss were produced by class and location even objectness, and loss from loc and obj were not relevant to class label, so losses multiplier could not work this out, actually in my dataset, class of box got high accuracy, its seems model just can't find the class-0's object which lack data on training dataset

I added 2 fixes, so now counters_per_class affects the objectness and bbox too. https://github.com/AlexeyAB/darknet/commit/35a3870979e0d819208a3de7a24c39cc0539651d https://github.com/AlexeyAB/darknet/commit/b8fe630119fea81200f6ca4641ce2514d893df04

Kyuuki93 commented 4 years ago

For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings

width = 384
height = 384
batch = 96
subdivisions = 16
learning_rate = 0.00025
burn_in = 1000
max_batches = 30200
policy = steps
steps = 15000, 20000, 25000
scales = .1,.1,.1
...
random = 0
iou_loss = giou
iou_normalizer = 0.5
iou_thresh = 0.213

Network	AP@.5	AP.75	precision(.7)	recall(.7)	Inference time
spinenet49-5l	90.46%	53.80%	0.93	0.71	32.17ms
yolov3-spp	89.98%	54.47%	0.96	0.53	11.77ms

Kyuuki93 commented 4 years ago

@AlexeyAB

There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?

AlexeyAB commented 4 years ago

@Kyuuki93

There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?

What do you mean?

If you want Unbounded fusion, then just use activation=linear instead of activation=NORM_CHAN_SOFTMAX

Kyuuki93 commented 4 years ago

@AlexeyAB

For example, wi is a scalar, P4_mid = Conv( ( w1*P4_in + w2* Resize(P5_in)) / ( w1+ w2) ), this wi should trainable but not relevant with any feature map

In ASFF, w was calculated by feature map through a conv_layer

AlexeyAB commented 4 years ago

@Kyuuki93

In ASFF, w was calculated by feature map through a conv_layer

Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py

If you want w constant during inference, then you can do something like this:

[route]
layers = P4

[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear

[route]
layers = P5

[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear

[shortcut]
from = -3

AlexeyAB commented 4 years ago

For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings

Also try to compare with spinenet(fixed, 3 yolo-layers) + spp, where is added SPP-block to the P5 or P6 block: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-568950286

https://github.com/AlexeyAB/darknet/blob/35a3870979e0d819208a3de7a24c39cc0539651d/cfg/yolov3-spp.cfg#L575-L597

Kyuuki93 commented 4 years ago

@AlexeyAB

Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py

def build_BiFPN() here is not so, it without w https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L40-L93

def build_wBiFPN() here is BiFPN with w https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L96-L149 w was defined here, actually, we need a layer like this one https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/layers.py#L33-L60

Maybe add a weights to [shortcut] layer is a option, also [shortcut] can take more than 2 inputs, something like

[shortcut]
from=P4, P5_up
weights_type = feature (or channel or pixel)
weights_normalizion = relu (or softmax or linear)
activation = linear

Kyuuki93 commented 4 years ago

Feature map on P6 only 4x4, could be too small to get useful feature?

Normally, SPP was on the middle and connected Backbone and FPN? Look like Backbone -> SPP -> FPN

But in Spinenet49, it seems all network is a FPN

Kyuuki93 commented 4 years ago

@AlexeyAB I moved spinenet related comment to its issue

AlexeyAB commented 4 years ago

@Kyuuki93

Feature map on P6 only 4x4, could be too small to get useful feature?

Yes, then spp should be placed in P5 (especially if you use small initiall network resolution)

[shortcut]
from=P4, P5_up
weights_type = feature (or channel or pixel)
weights_normalizion = relu (or softmax or linear)
activation = linear

Yes, or maybe just enough feature without channel or pixel

Interestingly, a fusion from BiPPN is more effective than such a fusion?

this is the same as w - a vector (per-channel) in BiFPN with ReLU
batch_normalize=1 - will do normalization to solve training instability issue
leaky in this block and in conv-layers L1, L2, L3, ensures that weights will be mostly positive too

[route] 
layers= L1, L2, L3    # output: W x H x 3*C

[convolutional]
batch_normalize=1
filters=3*C
groups=3*C
size=1
stride=1
pad=1
activation=leaky

[local_avgpool]
avgpool_depth = 1  # isn't implemented yet
                   #avg across C instead of WxH - Same meaning as maxpool_depth=1 in [maxpool]
out_channels = C

AlexeyAB commented 4 years ago

@Kyuuki93 It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset. https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075 Also turth_tresh=1.0 is good. So for your dataset is better to use iou_tresh=1.0 (or not use it at all).

Kyuuki93 commented 4 years ago

@AlexeyAB

It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset.

ignore_thresh = 0.85 got higher AP@.5 but much lower recall than ignore_thresh = 0.7

Also turth_tresh=1.0 is good.

Actually,

truth_tresh only worked on 1.0
when both truth_thresh and ignore_thresh set to 0.7 network become untrainable
keep ignore_thresh = 0.7, truth_thresh = 0.85`, decrease perfomance

So for your dataset is better to use iou_tresh=1.0 (or not use it at all).

What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?

Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why

AlexeyAB commented 4 years ago

@Kyuuki93

Happy New Year! :fireworks: :sparkler:

What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?

I mean may be better to use in your dataset:

ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=1.0

While for MS COCO may be better to use

ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=0.213

Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why

What methods do you mean?

In the original Darknet there are several issues which may degrade accuracy when using low values of ignore_thresh or truth_thresh

Initially in the original Darknet there were several wrong places which I fixed:

There was used if (best_iou > l.ignore_thresh) { instead of if (best_match_iou > l.ignore_thresh) { https://github.com/AlexeyAB/darknet/blame/dcfeea30f195e0ca1210d580cac8b91b6beaf3f7/src/yolo_layer.c#L355 Thus, it didn't decrease objectness even if there was an incorrect class_id. Now it decrease objectness if detection_class_id != truth_class_id - it improves accuracy if ignore_thresh < 1.0.
When truth_thresh < 1.0 then the probability that many objects will correspond to one anchor increases. But in the original Darknet, only the last (from label-txt-file) truth-bbox affected the anchor. I fixed it - now it averages deltas of all truths which correspond to this one anchor - so truth_thresh < 1.0 and iou_thresh < 1.0 may have a better effect:
- accumulate deltas: https://github.com/AlexeyAB/darknet/blame/dcfeea30f195e0ca1210d580cac8b91b6beaf3f7/src/yolo_layer.c#L153-L156
- divided by the number of objects of different classes (maybe we should also change it to divide by the number of objects not only different classes): https://github.com/AlexeyAB/darknet/blame/dcfeea30f195e0ca1210d580cac8b91b6beaf3f7/src/yolo_layer.c#L492
Also isn't tested and isn't fixed possible bug with MSE: https://github.com/AlexeyAB/darknet/issues/4594#issuecomment-569927386

Kyuuki93 commented 4 years ago

@AlexeyAB Happy New Year!

There are old cpc and new cpc results, seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it

Model	mAP@.5(C0/C1)	mAP@.75(C0/C1)
giou	79.53%(69.24%/89.83%)	59.65%(42.96%/76.34%)
giou,cpc	79.51% (69.07%/89.96%)	59.52%(42.17%/76.87%)
giou,cpc(new)	79.44%(70.03%/88.84%)	59.61%(44.95%/74.27%)

Kyuuki93 commented 4 years ago

I mean may be better to use in your dataset: iou_thresh=1.0 While for MS COCO may be better to use: iou_thresh=0.213

Actually, in my dataset iou_thresh = 0.213 always get better results, I think use a lower iou_thresh allows several anchors can predict same object, and in original darknet use only nearest anchor to predict object which limited yolo's ability, so set a lower iou_thresh will always get better results, just need to search a suit value for a certain dataset.

What methods do you mean?

Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means iou < 0.5, negative sample 0.5 <iou<0.7, ignore iou > 0.7, positive sample

I'm not sure this is exactly yolo's ignore_thresh and truth_thresh

AlexeyAB commented 4 years ago

@Kyuuki93

seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it

Yes.

Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means iou < 0.5, negative sample 0.5 <iou<0.7, ignore iou > 0.7, positive sample

Yes.

truth_thresh is very similar (but not the same) as iou_thresh, so this is strange that you get better result with higher truth_thresh and with lower iou_thresh.

For MS COCO iou_thresh=0.213 greatly increases accuracy.

AlexeyAB commented 4 years ago

@WongKinYiu @Kyuuki93 I am adding new version of [shortcut], now I am re-making [shortcut] layer for fast BiFPN: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-569197177

so be careful by using commits from Jan 7, 2020 it may have bugs in [shortcut] layer.

Before using, try to train small model with [shortcut] layer

WongKinYiu commented 4 years ago

@AlexeyAB

Okay, thanks.

Kyuuki93 commented 4 years ago

@AlexeyAB ok, thanks

AlexeyAB commented 4 years ago

@Kyuuki93 @WongKinYiu I added new version of [shortcut] layer for BiFPN from EfficientDet: https://github.com/AlexeyAB/darknet/issues/4662

So you can try to make Detector with 1 or several BiFPN blocks. And with 1 ASFF + several BiFPN blocks (yolov3-spp-asff-bifpn-db-it.cfg)

AlexeyAB commented 4 years ago

@nyj-ocean

[convolutional]
stride=1
size=1
filters=4
activation=normalize_channels_softmax

[route]
layers=-1
group_id=0
groups=4

...

[route]
layers=-1
group_id=3
groups=4

AlexeyAB commented 4 years ago

@nyj-ocean It is due that 4-th branch has 4=(2x2) more outputs. So you should use /2 less filters in conv-layers.

nyj-ocean commented 4 years ago

@AlexeyAB I reduce the value of filters in some [convolutional] layers. But the FPS of yolov3-4l+ASFF.cfg is still slow than yolov3-4l.cfg I am waiting to see whether the final mAP of yolov3-4l+ASFF.cfg increase or not compared with yolov3-4l.cfg

But the way , i want to try ASFF + several BiFPN ,where could i download the yolov3-spp-asff-bifpn-db-it.cfg in https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-572760285?

AlexeyAB / darknet

ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% mAP@0.5 with 45.5FPS #4382