This repo is the official implementation of "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer". By Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang and Yu Qiao.
11/14/2023
Thanks for Innat'help @innat. Now our models also support Keras! 😄
07/14/2023
UniFormerV2 has been accepted by ICCV2023! 🎉
02/13/2023
UniFormerV2 has been integrated into MMAction2. Training code will be provided soon! 😄
11/20/2022
We give a video demo in hugging face. Have a try! 😄
11/19/2022
We give a blog in Chinese Zhihu.
11/18/2022
All the code, models and configs are provided. Don't hesitate to open an issue if you have any problem! 🙋🏻
In UniFormerV2, we propose a generic paradigm to build a powerful family of video networks, by arming the pre-trained ViTs with efficient UniFormer designs. It inherits the concise style of the UniFormer block. But it contains brand- new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. It gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400.
All the models can be found in MODEL_ZOO.
See INSTRUCTIONS for more details about:
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{li2022uniformerv2,
title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Yu Qiao},
year={2022},
eprint={2211.09552},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This project is released under the MIT license. Please see the LICENSE file for more information.
This repository is built based on UniFormer and SlowFast repository.