Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
812 stars 111 forks source link

Question regarding Imagenet pretraining #122

Closed MLDeS closed 11 months ago

MLDeS commented 11 months ago

Thanks for the nice work! I have a question regarding model training reported in the paper. Its says

With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics600,

My question is the models are video models with n frames as input, whereas ImageNet is an image data with single inputs. So, my question is which part has ImageNet pretrained weights?

Andy1621 commented 11 months ago

All the parts have ImageNet-pretraining. For convolution, if the temporal dimension is larger than 1, we will copy and average the convolution weights. For self-attention, we copy the same weights. Please check the code https://github.com/Sense-X/UniFormer/blob/f92e423f7360b0026b83362311a4d85e448264d7/video_classification/slowfast/models/uniformer.py#L387-L421

MLDeS commented 11 months ago

Thanks a lot for the quick response, the pointer to the code helps a lot! Just two follow-up questions.

  1. I understand the imagenet pertaining is done on the image-based Uniformer architectures and transferred to video uniformer architectures by inflating weights as above, right?
  2. a) Is there a table showing a comparison between imagenet pertaining vs not? b) I see that Table 17 in the paper presents some results showing inflating the weights to 3D performs better than 2D. What is the basis of this comparison? Because if it is a video model, the 3D inflation was always done right ? Whether centered around the middle slice or equally averaged across the time dimension. So what is the 2D comparison here?

Thanks a lot again for your time to answer the questions!

Andy1621 commented 11 months ago

For convolution inflation, I suggest you read paper I3D.

As for your other questions:

  1. Yes.
  2. a) Without ImageNet pretraining, the convergence will be much slower, which is a common strategy in video training. b) 2D means we do not inflate the convolution, and merge the temporal dimension with the batch dimension. But for attention, we use spatiotemporal attention.
MLDeS commented 11 months ago

Thanks a lot for the answers!