ZiweiWangTHU / BiDet

This is the official pytorch implementation for paper: BiDet: An Efficient Binarized Object Detector, which is accepted by CVPR2020.
MIT License
173 stars 34 forks source link

SSD300_VGG16 architecture #16

Closed puhan123 closed 3 years ago

puhan123 commented 3 years ago

Hi! Have you restructured your network of SSD300_Vgg16?

Wuziyi616 commented 3 years ago

Hi! Thank you for your interest in our work! Yes, we do modify the structure of VGG16 as the backbone of SSD300_Vgg16. The modifications are two-fold.

First, we add shortcut connections as suggested by BiReal-Net here. Note that this is for BiDet (SC), while we don't use such shortcut in BiDet.

Secondly, we add BatchNorm (BN) after every Conv layers as here. That's because many previous works point out that BN is very important for binary neural networks (BNNs).

We have also tried training a real-valued SSD300_VGG16 network with such modifications, and we discovered that adding BN didn't improve the accuracy, while adding shortcut even degraded the performance slightly. So we just report the original mAP of SSD in our paper. Also, I want to mention that these two operations (shortcut and BN) only add minor overhead to model FLOPs and parameter size. In contrast, they siginificantly improve the mAP of BiDet. For example, directly applying BiDet to vanilla SSD_VGG16 only gets an mAP of ~45% (if I remember correctly).

puhan123 commented 3 years ago

Thank you very much! Your answer helped me a lot!

puhan123 commented 3 years ago

Excuse me ,I have another question!

When you constructs the detector heads of ssd,did you also add BatchNorm(BN) after every Conv layers?

Wuziyi616 commented 3 years ago

Hi! This is a good question, and I did some ablation study in my experiments which was not shown in the paper. We didn't add BN after Conv layers in the detector heads (of SSD300_VGG16) as you can see from the code here. I have tested adding BN, and surprisingly, the mAP degraded ~0.5%. Also, it seems no benefit in training speed and stability. So we didn't use BN in detector head in the final version of BiDet.

I didn't delve deep into the its reason. But I conjecture that this is because the Conv layers in the detector not only extract features, they also need to localize the objects. Thus using Normalization methods like BN may harm the localization ability of Conv layers. As you can imagine, BN pushes feature maps to be like a normal distribution, which may make them less discriminative to distinguish different objects and localize them. However, I have to say this is all my conjecture and I am not sure whether this is true.

puhan123 commented 3 years ago

Thank you very much! Your answer helped me a lot!

I notice that you take out the activation layer of some layers in bidet_ssd.py. Why would you do that?

And I also found you clip the maxpool layer in VGG16, and replace it with downsampling conv with stride==2. Can I understand this way?

Wuziyi616 commented 3 years ago

For the first question, do you mean here? If so, this is because we need to use intermediate feature maps to calculate I(X; F) as one of the loss term.

For the second question, yes you are right. I forgot this in my yeasterday's response. Indeed, I replace the MaxPool in VGG with stride2 Conv, because I discovered that MaxPool+BinaryConv performed very poorly in the detection task. Really sorry for my mistake, since this work was done one year ago and I haven't been working on it for a while.

BTW, there is another small modification of SSD's detector head's localization output branch. The original SSD predicts 4 values for a localization output, which is shift over x y and scale over x y. Here I use 8 values. Because BiDet adopts the Information Bottleneck (IB) principle, so the model output should be distributions rather than deterministic values. Therefore model shift and scale using Normal Distribution, and use 8 values to represent them, 4 for the mean and 4 for std. (To be honest, the need to use distribution rather than deterministic value is derived from the IB theory. In my experiments, I didn't find much difference between them. I can get similiar mAP using either output form.)

puhan123 commented 3 years ago

Thank you very much for your reply!