TengdaHan / MemDPC

[ECCV'20 Spotlight] Memory-augmented Dense Predictive Coding for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.
Apache License 2.0
164 stars 20 forks source link

About trained model weight #6

Closed RERV closed 3 years ago

RERV commented 4 years ago

Hi, I notice that you release two pretrained weight. However, it`s still need a long time to finetune. Could you release the final trained model weight so we can easy test on it? Thanks!

TengdaHan commented 4 years ago

Hi! Sorry for the late reply. I still recommend you finetune on your own version of dataset/settings because a tiny difference could reduce the final percentage and it's hard to debug. But if you really need one, here it is: http://www.robots.ox.ac.uk/~htd/memdpc/ft_ucf101_224_resnet34_memdpc.pth.tar

chenbiaolong commented 3 years ago

@TengdaHan did you filter out too short videos in test mode? what's the accuracy of the checkpoint(ft_ucf101_224_resnet34_memdpc.pth.tar) you provided?I just refactor your code and want to test if my implementation is right

TengdaHan commented 3 years ago

This exact checkpoint gets:

CenterCrop: Acc@1: 0.7673 Acc@5: 0.9280
FiveCrop:   Acc@1: 0.7801 Acc@5: 0.9323
TenCrop:    Acc@1: 0.7811 Acc@5: 0.9366

It's also the 78.1% reported in paper Table 2. Small variation is possible after you refactor the code.

I didn't filter out too short videos, I pad short videos by their last frame, up to the required length. -- but this actually won't affect much on the results.

chenbiaolong commented 3 years ago

@TengdaHan thank you for your reply, I found a tiny difference in model_lc.py and model.py, which one is the right version? The implementation of train and eval are not equivalent

TengdaHan commented 3 years ago

I modify the 2D3D-ResNet backbone such that the output feature is before final ReLU: https://github.com/TengdaHan/MemDPC/blob/11f03299496c55d3ecae670752e958d8ce0c80fb/backbone/resnet_2d3d.py#L251 It's because during pretraining, we will contrast predicted features (with a scale of (-inf, inf)) with ground-truth features. Removing final ReLU is to keep the ground-truth feature also has a scale of (-inf, inf). Although others (MoCo, SimCLR, etc.) can use a prediction head to avoid this scaling issue -- we didn't use prediction head for the ground-truth features in this paper.

In the evaluation stage for the action classification task, we add ReLU back, followed by a linear layer. Nothing special.

Let me know if unclear.