Closed MLDeS closed 1 year ago
All the parts have ImageNet-pretraining. For convolution, if the temporal dimension is larger than 1, we will copy and average the convolution weights. For self-attention, we copy the same weights. Please check the code https://github.com/Sense-X/UniFormer/blob/f92e423f7360b0026b83362311a4d85e448264d7/video_classification/slowfast/models/uniformer.py#L387-L421
Thanks a lot for the quick response, the pointer to the code helps a lot! Just two follow-up questions.
Thanks a lot again for your time to answer the questions!
For convolution inflation, I suggest you read paper I3D.
As for your other questions:
2D
means we do not inflate the convolution, and merge the temporal dimension with the batch dimension. But for attention, we use spatiotemporal attention.Thanks a lot for the answers!
Thanks for the nice work! I have a question regarding model training reported in the paper. Its says
My question is the models are video models with n frames as input, whereas ImageNet is an image data with single inputs. So, my question is which part has ImageNet pretrained weights?