TengdaHan / CoCLR

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.
Apache License 2.0
286 stars 32 forks source link

Question regarding the supervised result in table 1 #2

Closed liuhualin333 closed 3 years ago

liuhualin333 commented 3 years ago

Hi Tengda,

Nice work and thanks for sharing the code! I have a question regarding results in table 1 of CoCLR paper. I notice that supervised training with RGB input on S3D-G architecture on UCF101 yields 77.0% top 1 accuracy. I have run similar supervised training experiments on 2d3d network (MemDPC one) without initialization of any weights (such as ImageNet 2d weight) but I encounter serious overfitting issue and can only get 40+ top 1 accuracy on UCF101. So I think this result is unusually high. I wonder if you use initialization of other weights or you train it from scratch. If you train it from scratch, have you encountered any overfitting issue on S3D-G architecture?

By the way, the overfitting issue on small video datasets for 3d resnet is validated by this paper: https://openaccess.thecvf.com/content_cvpr_2018/html/Hara_Can_Spatiotemporal_3D_CVPR_2018_paper.html

Looking forward to your reply.

Best Regards, Hualin

TengdaHan commented 3 years ago

Hi! 3D-network overfits very easily when trained on small datasets like UCF101 and HMDB51. That's why I use aggressive dropout: https://github.com/TengdaHan/MemDPC/blob/master/eval/test.py#L36 And I keep the same setting everywhere when finetuning on small datasets.

liuhualin333 commented 3 years ago

Thanks for the reply!