Hyungjun-K1m / BinaryDuo

Torch-7 implementation of BinaryDuo (ICLR 2020).
9 stars 1 forks source link

Reproducing results with Pytorch on ImageNet #2

Closed mengjian0502 closed 3 years ago

mengjian0502 commented 3 years ago

Hi there,

First of all, it is a really interesting work about the binary neural network quantization! I'm trying to reproduce your ImageNet results in Pytorch via Python with ResNet-18(without additional shortcut), and I got some questions regarding the coupled binary-model:

  1. For the weight quantization, I noticed your quantization function is: self.weightQ:copy(self.weightOrg):sign():add(0.1):sign():mul(absE) If I understand correctly, in Pytorch it should be re-written as: self.weight_q = self.weight.org.sign().add(0.1).sign().mul(absE) May I ask why you add 0.1 offset for the weight quantization?

  2. I compared your ResNet-18 model with the original torch model from Facebook, it seems like the forward pass of your residual block is HardTanh(Ternary) -> BinaryConv -> BatchNorm. In my previous implementation, I tried the conventional model structure and the coupled ternary model can only give me ~46% TOP@1 acc. May I ask why you choose this type of structure?

Hyungjun-K1m commented 3 years ago

Thanks for the interest! It's great to hear that you're trying to reproduce our work in Pytorch!

  1. The +0.1 term in weight binarization is used to ensure that there are no '0's after binarization. :sign() function outputs -1 for negative inputs, +1 for positive inputs and 0 for 0 input.

  2. When binarizing ResNet models, the block configuration is really important. As you mentioned layers in original ResNet block has ordered as Conv-BN-Act. However, when quantizing the model, you have to ensure that quantization happens right before the convolution layer. If you use the original ResNet structure and quantize the activation after BN layer, the shortcut addition after the Act (quantization) layer makes the activation precision high therefore the convolution input does not have the precision we wanted. You can find other BNN version of ResNet models written in Pytorch (e.g. here) and they are in the same architecture as ours.

Questions are always welcome!

mengjian0502 commented 3 years ago

Thank you so much for your reply!

After I posted this, I tried the exact same architecture as you used in your resnet_couple.lua. However, it seems like the accuracy still cannot match the reported results:( According to the reported training graph, I'm targeting for >40% Valid Acc after around epoch 15: image

But here's my training log until epoch 14: _---- -------- --------- -------- --------- --------- -------- ---------- ep lr tr_loss tr_acc tr_time te_loss te_acc best_acc


1 0.0050 5.2586 8.0109 1278.2992 5.0309 10.6600 10.6600 2 0.0050 4.2404 18.2481 1272.1327 4.2708 17.6620 17.6620 3 0.0050 3.8952 22.8683 1271.0228 3.9787 21.4340 21.4340 4 0.0050 3.7332 25.2462 1269.3032 3.7661 24.4900 24.4900 5 0.0050 3.6367 26.6375 1269.3312 4.2486 18.9480 24.4900 6 0.0050 3.5696 27.6482 1268.0745 3.9182 23.2000 24.4900 7 0.0050 3.5225 28.3587 1268.3683 4.1807 20.7660 24.4900 8 0.0050 3.4857 28.9452 1268.0337 3.9151 23.4480 24.4900 9 0.0050 3.4564 29.3564 1268.2426 4.2566 19.5100 24.4900 10 0.0050 3.4334 29.7219 1267.7278 4.5397 17.5760 24.4900 11 0.0050 3.4148 29.9761 1266.4872 6.0252 10.4980 24.4900 12 0.0050 3.3984 30.2526 1267.8896 4.1347 21.2900 24.4900 13 0.0050 3.3830 30.5357 1267.3226 3.9362 22.7660 24.4900 14 0.0050 3.3717 30.7573 1267.1464 5.2241 13.1580 24.4900_

Here's my decoupled model structure, can you quickly check this? _ResNet_imagenet( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu1): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Identity() ) (1): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Identity() ) ) (layer2): Sequential( (0): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(64, 90, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(90, 90, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): AvgPool2d(kernel_size=2, stride=2, padding=0) (1): Conv2d(64, 90, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): Replicate(decouple=False) ) ) (1): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(90, 90, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(90, 90, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(90, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Identity() ) ) (layer3): Sequential( (0): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(90, 180, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(180, 180, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): AvgPool2d(kernel_size=2, stride=2, padding=0) (1): Conv2d(90, 180, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): Replicate(decouple=False) ) ) (1): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(180, 180, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(180, 180, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(180, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Identity() ) ) (layer4): Sequential( (0): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(180, 360, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(360, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(360, 360, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(360, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): AvgPool2d(kernel_size=2, stride=2, padding=0) (1): Conv2d(180, 360, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(360, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): Replicate(decouple=False) ) ) (1): BasicBlock( (tanh1): TerneryHardTanh() (conv1): BinaryDecoupleConv2d(360, 360, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep1): Replicate(decouple=False) (bn1): BatchNorm2d(360, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (tanh2): TerneryHardTanh() (conv2): BinaryDecoupleConv2d(360, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False, decouple=False, decouple=False, quantize_a=False) (rep2): Replicate(decouple=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Identity() (1): OutChannelPad() ) ) ) (avgpool): AvgPool2d(kernel_size=7, stride=7, padding=0) (relu2): ReLU(inplace=True) (fc): Linear(in_features=512, outfeatures=1000, bias=True) )

mengjian0502 commented 3 years ago

Also I forgot to mention, the OutChannelPad is a special case of shortcut as you did(to match up the number of input sizes): elseif type == 'last' then -- Strided, zero-padded identity shortcut return nn.Sequential() -- :add(nn.SpatialAveragePooling(1, 1, stride, stride)) :add(nn.Concat(2) :add(nn.Identity()) :add(nn.Sequential() :add(nn.MulConstant(0)) :add(nn.Narrow(2,1,152))))

Hope my messages doesn't bother you too much! Thanks again for your help!

Hyungjun-K1m commented 3 years ago
  1. There is not BN layer between maxpool and layer1.
  2. Avgpool and relu2 layers are switched.

Other than these points, it seems the model is fine if the Binary DecoupleConv2d and Replicate layers work well.

I recommend to use the torch code as is and compare the result with Pytorch result for easier debugging.

Hyungjun.

mengjian0502 commented 3 years ago

Thank you so much for your help! I'll update the code with your suggestions and see how it goes:)