Closed June01 closed 3 years ago
90.6 means [self-supervised pretrain on K400 (no label)] -> [finetune on UCF101] 96.8 means [supervised pretrain on K400 (use label)] -> [finetune on UCF101] The meaningful random init baseline is 77.0 in our Table 1, meaning [pretrain on nothing] -> [finetune on UCF101].
Hi, thanks for the answer. I checked [1] again, and found out in fully-supervised learning, I3D pretrained on nothing and achieve 88.8% (Table 4) with rgb only. I am wondering what makes such a big gap between 77.0 and 88.8?
[1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
I see! Appreciate your help! I am wondering is there any ImageNet pretrained-weights provided by S3D? As many paper claimed using it.
S3D is a 3D CNN, usually being used for the video input. I don't think people train S3D on ImageNet. If there is paper claiming using S3D with ImageNet weights, I guess that means "ImageNet inflated" weights, where they take the I3D network, expand convolution kernels by coping multiple times along the time dimension.
Agree! Thanks very much!:)
Hi Tengda,
Thanks for the detailed instruction for this code. I am a newbie in this field, have a very simple question regarding to table 2, and in desparate need of your help. Thanks very much in advance!
Question: From what I understand, self-supervised learning could be used to learn essencial video representation. So I guess with weights learnt by self-supervised learning methods, training the S3D network on UCF-101 will yield better results than train with random initialization. From Table 2, I suppose 90.6 is the former, and 96.8 is the latter. Would you like to explain a bit why there is such a gap?