BestJuly / IIC

Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'
111 stars 13 forks source link

UCF101 action classification result only at 0.68 #2

Closed wuchlei closed 4 years ago

wuchlei commented 4 years ago

Hi, thanks for the good work.

I'm currently trying to reproduce the action classification results on UCF101. Using the training parameters that you provided, I've trained my own backbone network and the linear classifier. However, I'm only getting a 0.68 accuracy with the RGB+Res+Repeat settup. I've also trained a linear classifier with the backbone network that you provided, I'm also only getting a accuracy around 0.685.

Could you please help me with this problem? Is there anything you think that could go wrong? Can you share the weights of your linear classifier network?

Thanks very much.

BestJuly commented 4 years ago

Hi, @wuchlei. Thank you for your interest.

Actually, when I prepared this repo, I trained the finetuning part for about 2 times, and the results are 67.4% and 69.8%. With different random seeds, the performances vary from one to another. Therefore, I think 0.685 and 0.68 are acceptable.

By the way, in our old code version, we directly used x - shift_x instead of ((x - shift_x) + 1) / 2 and achieved 71.8%. The motivation of using ((x - shift_x) + 1) / 2 is to set the residual input in the similar range as the RGB view during self-supervised training. And we found directly using x - shift_x during finetuning period can achieve better performances.

To further improve the performance, you can try many strategies such as

  1. use different data augmentation methods, such as random flipping, color jittering. Or some recent works such as CutMix;
  2. use more crops of the same clip, such as top-left, top-right, center, bottom-left, and bottom right, and average them.

Those are usable and effective tricks while we do not include them here because this is not our target in this paper.

wuchlei commented 4 years ago

Thanks for the reply. I'll give it a try.

wuchlei commented 4 years ago

@BestJuly hi Li, I've tried what you said that directly using x -shift_x for ft_clasify. The modification to the code is :

def diff(x):
    shift_x = torch.roll(x, 1, 2)
#     return ((x - shift_x) + 1) / 2
    return x - shift_x

However I'm still only getting an accuracy around 0.684. I've even tried to use the strong augmentations in SimCLR to train the backbone and I'm only getting a 2% percent improvement (0.702 accuracy, still not close to the 0.72 accuracy reported in the paper). Could you please share ur code or training scripts for the fine-tuning phase? It would be very helpful.

BestJuly commented 4 years ago

Hi, @wuchlei .

May I ask a question, which SSL pretrained model do you use? I rerun the code twice using the provided model and the current code (ft_classify.py, x - shift_x version) and get 71.2% @ top1 and 72.7% @ top1 respectively. I have uploaded the model with accuracy 72.7% to cloud drive.

*Please note that for testing dataset, I use CenterCrop instead of RandomCrop in corresponding data transformations. This is a bug in previous ft_classify.py file and I have fixed it. Our previous experimental environments used two separate files for finetuning and testing, therefore, this is just a bug when I conduct code refactoring.

The reported result on UCF101 split 1 in the paper is 71.8% @ top1 (Table. 5, settings: frame repeating, res, R3D). I think it is reasonable and here I do not use any strong data augmentations.

Again, I want to say that achiving the exact the same results as that in the paper is impossible because the training procedure includes SSL pretraining and finetuning, and the final recognition results may be affected by each step.

wuchlei commented 4 years ago

Thanks for the reply. With these fixes, I've reproduced the results (around 0.72 for Res+Repeat). With SimCLR strong augmentation, I'm even getting a 0.8% improvement. Again, thanks for the help and best of luck with your research!

BestJuly commented 4 years ago

Oh, that is good news. SimCLR strong augmentations can have improvements and other augmentations can also help if you want to have better performances. I remembered there is one ECCV'20 paper about the augmentation methods for videos.

Also, good luch with your experiments~