Question on mismatch with the original paper

Tushar-N / pytorch-resnet3d

I3D Nonlocal ResNets in Pytorch

245 stars 39 forks source link

Question on mismatch with the original paper #6

Closed AlexHu123 closed 5 years ago

AlexHu123 commented 5 years ago

Hi, I have test your ResNet-50 baseline with 19374 video, but I still find the accuracy: [clip: 64.7, video: 72.1], while in the original repo, it reaches 73.6. Can you figure out why it causes ? Thanks in advance !

Tushar-N commented 5 years ago

It's probably a combination of the 700 missing videos + the tiny mismatches in the activation values when converting from caffe2 to pytorch. If you run python -m utils.layer_by_layer --model r50 you can get a layer-by-layer comparison of the activations as a sanity check for what the differences might be. I don't know the exact inconsistencies in the way the operations are implemented in each framework.

AlexHu123 commented 5 years ago

hey, now I find something different in your implementation, if you extract 64 continuous frames and then extract 32 frames every other frame, then you will find the accuracy gets improved !

AlexHu123 commented 5 years ago

Thanks for your reply and really good work!

Tushar-N commented 5 years ago

Sounds like your frames may have been sampled at 60fps (instead of the 30fps that the network was trained with). Glad you could resolve your issue! Before I close this, could you comment on this issue with the numbers you finally got, as a reference for others?

AlexHu123 commented 5 years ago

I extract frames at 30 fps, after that, for instance I choose continuous 64 frames from them (frame 1, frame2, ..., frame 64). But when I feed into network, I use frame 1, frame 3, frame5, ... frame 63. (That causes the 32 frames.)

AlexHu123 commented 5 years ago

By the way, have you tested 2d-tsn-resnet50 in the non-local repo? And I don't know how to convert weights.

Tushar-N commented 5 years ago

I extract frames at 30 fps, after that, for instance I choose continuous 64 frames from them (frame 1, frame2, ..., frame 64). But when I feed into network, I use frame 1, frame 3, frame5, ... frame 63. (That causes the 32 frames.)

Got it! I meant the final accuracy numbers you get using the 19k+ validation videos (since I only tested with 18k). Does it match the 73.6%?

By the way, have you tested 2d-tsn-resnet50 in the non-local repo? And I don't know how to convert weights.

No I haven't tested it. It's likely that the 2D resnet architecture is almost identical to the torchvision resnet models, so the weight conversion should be similar to what is done for the i3d models -- just a simple renaming of the keys.

AlexHu123 commented 5 years ago

Hi, sorry for the delay! I test 3d-resnet-50 baseline, and find the accuracy increases to 67.3% from 64.8% on clip-level.

AlexHu123 commented 5 years ago

And I would be thankful if you have plan for releasing 2d-resnet50 converted model