facebookresearch / VMZ

VMZ: Model Zoo for Video Modeling
Apache License 2.0
1.04k stars 156 forks source link

Wrong number of midplanes in r(2+1)d model's downsampling basic blocks (layer 2,3,4) #89

Closed daniel-j-h closed 5 years ago

daniel-j-h commented 5 years ago

Hey folks, I'm studying your r(2+1)d models and here is an interesting observation:

You can see this difference in the provided pickle files if you manually pull out the appropriate blobs and check their tensor shape. To make life easier in the following I explain my observation by means of the original resnet and the (2+1)d resnet architecture in PyTorch (ain't nobody got time for pickle+caffe2):

See how in the first downsampling blocks in the video resnet we pass on self.inplanes before updating it before we calculate the number of midplanes here.

In the Caffe2 models you provide (and in the corresponding pickle files) these number of midplanes are calculated based on the wrong number of planes.

Here are the three first downsampling blocks of layer2, layer3, layer4 from your model (check out planes in the first conv., in bn, and in plane in last conv):

Conv2Plus1D(
  (0): Conv3d(128, 288, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(288, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

Conv2Plus1D(
  (0): Conv3d(256, 576, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(576, 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

Conv2Plus1D(
  (0): Conv3d(512, 1152, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(1152, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(1152, 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

If you calculate midplanes in the downsampling blocks as per PyTorch (and the original resnet) you get the following (compare with above):

Conv2Plus1D(
  (0): Conv3d(128, 230, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(230, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(230, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

Conv2Plus1D(
  (0): Conv3d(256, 460, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(460, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(460, 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

Conv2Plus1D(
  (0): Conv3d(512, 921, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(921, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(921, 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

This results in your models having much more planes in the layer2, layer3, layer4 downsampling blocks compared to other implementations such as the PyTorch version.

cc @fmassa @bjuncek for visibility since have worked on the torchvision implementation and trained these models. Side note: the r(2+1)d model here uses eps=1e-3 and momentum=0.9 vs in PyTorch 1e-5 and 0.1 for batchnorm; might be another interesting difference when trying to reproduce results.

bjuncek commented 5 years ago

Hi @daniel-j-h

I actually think this might be my reimplementation error in torchvision as I've implemented the pytorch model to the best of my understanding of the paper, with some guidance from Du (note @dutran is the original author of the paper and maintainer of this repo).

Effectively, the issue is that the original equation in the paper can be understood in two different ways - layer wise (as it's implemented here) and block wise (as it's implemented in torchvision), however the performance difference is negligible.

If you want a 100% correct re-implementation, you can simply replace the existing Conv2Plus1D module in pytorch with

class Conv2Plus1D(nn.Sequential):

    def __init__(self,
                 in_planes,
                 out_planes,
                 midplanes,
                 stride=1,
                 padding=1):

        midplanes = (in_planes * out_planes * 3 * 3 * 3) // (
                in_planes * 3 * 3 + 3 * out_planes)
        super(Conv2Plus1D, self).__init__(
            nn.Conv3d(in_planes, midplanes, kernel_size=(1, 3, 3),
                      stride=(1, stride, stride), padding=(0, padding, padding),
                      bias=False),
            nn.BatchNorm3d(midplanes),
            nn.ReLU(inplace=True),
            nn.Conv3d(midplanes, out_planes, kernel_size=(3, 1, 1),
                      stride=(stride, 1, 1), padding=(padding, 0, 0),
                      bias=False))

this solves all the issues.


Note that this has been to an extent documented in #82 and #1265.

I'll send a PR to torchvision with the fix once I train all the models again :)