Conceptual Question on Video Bottleneck Models

bjuncek commented 4 years ago

Hi @dutran et al,

I was wondering, is not including the block expansion in spatio-temporal convolution by design?

Specifically, let's say VideoModelBuilder makes a model where layer_2 has 2 bottleneck layers (for illustration purposes). Then the graph from VMZ looks like

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 144
      in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

When according to the block formula for the second bottlenect, one would expect to have the middle layer expanded as well, i.e.

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 177
      in 177, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

Note that the difference is the separated conv layer - specifically midplanes computation.

In the original paper, it specifies the midplanes formula but it's ambiguous wether N_i refers to block or layers

Just as a side note, I found that models with block expansion accounted for get approx. 0.9% better results on Kinetics with R2+5D50, but the setup was different than specified in paper so it's not really apple-to-apple comparison.

Cheers, Bruno

dutran commented 4 years ago

Which model exactly you are referring to? The example 1 you provide is not the one we implemented and neither the second example.

In our bottleneck blocks, we always have something like, let's start with 3D case,

1x1x1, input_dim, 4*n
3x3x3, 4*n, n
1x1x1, n, 4*n

Note that this is the design from ResNet [He at al. CVPR'16] where 1x1 filters are about 4x more than the middle layer 3x3, that's why it is name bottleneck. The above is a naive extension to 3D case. When we move to (2+1)D, we design a (2+1)D block to match the # of parameter of the 3x3x3 3D conv, it become

1x1x1, input_dim, 4*n
1x3x3, 4*n, M_i
3x1x1, M_i, n
1x1x1, n, 4*n

So virtually, (2+1)D block is designed to replace 3x3x3 with the same parameters/FLOPs cost, obviously it has increase memory overhead. M_i is specified in the paper as well as in the code. We can stack many of these bottleneck blocks, depend on the network depth. There are case when dimensions of filters are double when you move from conv_2x to conv_3x and so on.

Hope this helps.

bjuncek commented 4 years ago

Note that this is the design from ResNet [He at al. CVPR'16] where 1x1 filters are about 4x more than the middle layer 3x3, that's why it is name bottleneck.

Looking at the official implementation [1], I don't think that's correct, specifically, it looks like the naive bottleneck looks like the following

1x1x1, in_dim, n
3x3x3, n, n
1x1x1, n, 4*n

If you look at the output of that code (omitting BN and RELU's), the middle bottleneck for the naive resnet 152 layer 1 looks something like the following:

(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)

But this is beside the point. Even if the implementation is as you've written it, there is a misunderstanding on how the M_i is actually calculated. Specifically, paper specifies the formula

$\frac{td^2*N_{i-1}N_i}{d^2*N_{i-1}+tN_i}$

Since we know that t=d=3, the question is what N_i and N_{i-1} actually are if we follow your example, we have N_{i-1} = 256 and N_i = 64 - then the equation becomes

$\frac{27*256*64}{9*256+3*64} = 177.4...$

But the actual M_i that I have in the weights dimension is 144, which corresponds to N_i = N_{i-1} = 64 which means that in eIther way the bottleneck multiplier is not considered in the computation of M_i. Is that something we should follow in principle for our reimplementation?

dutran commented 4 years ago

@bjuncek

Ah, you're right. The original ResNet paper uses expanded bottom. If I recall correctly, all of the experiments in our CVPR'18 use simple block, we should not have any mismatch.
For the (2+1)D conv, it is implemented here: https://github.com/facebookresearch/VMZ/blob/master/lib/models/builder/video_model.py#L65-L94 pasting the code here:
```
i = 3 * in_filters * out_filters * kernels[1] * kernels[2]
i /= in_filters * kernels[1] * kernels[2] + 3 * out_filters
middle_filters = int(i)
```
It is exactly the formula, except here I hard-coded t=3.

To conclude, it is pretty much your design choice on how many filters you want to implement for those layers. My suggestions are:

1) If you want to do ablation experiments, e.g. compare the difference between 3D vs. (2+1)D, then whether using expanded filters or not should not be an issue, as long as ResNet3D and R(2+1)D have the same number of filters (assume we adjust middle filters to match params and FLOPs of (2+1)D with 3D conv).

2) If you want to convert the models from VMZ, then please follow the number of filers provided in pickle models.

3) If you want to train from scratch, expanded or not, it's again up to your design choice. It is no surprised to me that expanded filers will give 0.9% up, note that it will come with more params, FLOPs, and memory.

bjuncek commented 4 years ago

Sounds good, thanks for clarification.

facebookresearch / VMZ

Conceptual Question on Video Bottleneck Models #82