facebookresearch / video-nonlocal-net

Non-local Neural Networks for Video Classification
Other
1.98k stars 323 forks source link

Copying pre-trained weights to pytorch model #65

Closed patykov closed 5 years ago

patykov commented 5 years ago

Hi! I'm working in a pytorch implementation of your work but my results are far behind the expected. I've created a pytorch version of the i3d_nonlocal model and copied the pre-trained weights of the file "i3d_nonlocal_32x2_IN_pretrain_400k.pkl" to it. I've double checked all the layers and they seen to be equal. I'm using the Kinetic dataset you've provided and I'm applying normalization (mean, std). However, the results I'm getting for the validation set in the fully convolutional eval are 48%/75% (top1/top5).

I was wondering if anyone could have an idea of what I'm doing wrong. Maybe copying the weights to pytorch isn't as simple as relating:

conv1_w --> conv1.weight res_conv1_bn_b --> bn1.bias

and so on? I've checked the RGB/BGR input difference between the frameworks, but it didn't help.

HaiyiMei commented 5 years ago

Hey there! Have you figured it out why it's happening? I'm facing the same questions.

patykov commented 5 years ago

Hi! Actually I did, it was a combination of factors really. First, I forgot to activate the use of the BN mean/var during evaluation (my bad!). Second, since we are copying the pre-trained weights the non-local block operations must be exactly the same as the ones used by the authors, and the common pythorch implementation of the non-local block we find online differs a little from it. You should add a 'scale' operation in it: https://github.com/facebookresearch/video-nonlocal-net/blob/c253ed5eb004fa2cae0490ed46f5018a3c3b060f/lib/models/nonlocal_helper.py#L86 As they followed section 3.2.1 https://arxiv.org/pdf/1706.03762.pdf . So in my foward function of the non-local block I add:

...
        f = torch.matmul(theta_x, phi_x)
        f_sc = f * (self.inter_channels**-.5)  # https://arxiv.org/pdf/1706.03762.pdf section 3.2.1
        f_div_C = F.softmax(f_sc, dim=-1)
...

And finally, you should change the order of the maxpool layer to before the operations phi and g.

self.g = nn.Sequential(max_pool_layer, self.g)
self.phi = nn.Sequential(max_pool_layer, self.phi)

This modifications solved the problem for me, hope it helps!

HaiyiMei commented 5 years ago

Awesome! This definitely helps me a lot! Appreciated! Plus, I saw you stared this repo. This is the baseline code you use, right? I run the kinetics validation dataset through this code and weights. And I got an accuracy that is about 10% lower than it reported. (I've tried several other repos of pytorch i3d implement, none of them are working. It's driving me crazy😭) Have you run the validation set on this code? Or is there something I need to pay attention to in video2jpg stuff, data preprocessing(torchvison.transforms) or something else? Thanks a lot!

HaiyiMei commented 5 years ago

Hey! I figured it out! @patykov It is because of the function glob.glob(). After the glob function, the frames should be sorted like: frames.sort(). Otherwise, the frames would be shuffled(or in system file binary order or something), not in time sorted.

AlexHu123 commented 5 years ago

@HaiyiMei @patykov I have seen your talk. would you mind if you release your implementation of pytorch code for 2D-Resnet and 3D-resnet baseline ?

HaiyiMei commented 5 years ago

@AlexHu123 Hey! I think this(3d) and this(2d) is what you need.

AlexHu123 commented 5 years ago

@HaiyiMei Thanks for your reply, and I will check it~

AlexHu123 commented 5 years ago

@HaiyiMei Hi, how do you transfer 2d-resnet50 weights into pytorch? Have you test 2d resnet50 model using pytorch?

HaiyiMei commented 5 years ago

@AlexHu123 Sorry, I didn't test 2d resnet50 on kinetics. I just used the pretrained weight provided by torchvision before on image classification.