Closed AlexHu123 closed 5 years ago
It's probably a combination of the 700 missing videos + the tiny mismatches in the activation values when converting from caffe2 to pytorch. If you run python -m utils.layer_by_layer --model r50
you can get a layer-by-layer comparison of the activations as a sanity check for what the differences might be. I don't know the exact inconsistencies in the way the operations are implemented in each framework.
hey, now I find something different in your implementation, if you extract 64 continuous frames and then extract 32 frames every other frame, then you will find the accuracy gets improved !
Thanks for your reply and really good work!
Sounds like your frames may have been sampled at 60fps (instead of the 30fps that the network was trained with). Glad you could resolve your issue! Before I close this, could you comment on this issue with the numbers you finally got, as a reference for others?
I extract frames at 30 fps, after that, for instance I choose continuous 64 frames from them (frame 1, frame2, ..., frame 64). But when I feed into network, I use frame 1, frame 3, frame5, ... frame 63. (That causes the 32 frames.)
By the way, have you tested 2d-tsn-resnet50 in the non-local repo? And I don't know how to convert weights.
I extract frames at 30 fps, after that, for instance I choose continuous 64 frames from them (frame 1, frame2, ..., frame 64). But when I feed into network, I use frame 1, frame 3, frame5, ... frame 63. (That causes the 32 frames.)
Got it! I meant the final accuracy numbers you get using the 19k+ validation videos (since I only tested with 18k). Does it match the 73.6%?
By the way, have you tested 2d-tsn-resnet50 in the non-local repo? And I don't know how to convert weights.
No I haven't tested it. It's likely that the 2D resnet architecture is almost identical to the torchvision resnet models, so the weight conversion should be similar to what is done for the i3d models -- just a simple renaming of the keys.
Hi, sorry for the delay! I test 3d-resnet-50 baseline, and find the accuracy increases to 67.3% from 64.8% on clip-level.
And I would be thankful if you have plan for releasing 2d-resnet50 converted model
Hi, I have test your ResNet-50 baseline with 19374 video, but I still find the accuracy: [clip: 64.7, video: 72.1], while in the original repo, it reaches 73.6. Can you figure out why it causes ? Thanks in advance !