Closed JinglUO703 closed 2 years ago
Thanks for your question. YES, the input is different, we simply inflate the 2D kernels to 3D for video input. The code is as follows: https://github.com/Sense-X/UniFormer/blob/e8024703bffb89cb7c7d09e0d774a0d2a9f96c25/video_classification/slowfast/models/uniformer.py#L387-L421
As there is no more activity, I am closing the issue, don't hesitate to reopen it if necessary.
How does the video transformer pre train on image1k? Isn't the input different? for example : 3D patch-embedding in video, 2D patch-embedding in image?