Please add the following four papers which use transformer backbones:
Egocentric Video-language pre-training and solves video-text retrieval, video classification, text-guided video grounding, text-guided video summarization, video question-answering etc.
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone (ICCV 2023) [Paper] [Code] [Project] [Poster]
Video temporal grounding, unifying diverse temporal annotations to power moment retrieval (interval), highlight detection (curve) and video summarization (point).
UniVTG: Towards Unified Video-Language Temporal Grounding (ICCV 2023) [Paper] [Code]
Please add the following four papers which use transformer backbones:
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone (ICCV 2023) [Paper] [Code] [Project] [Poster]
Egocentric Video-Language Pretraining (NeurIPS 2022) [Paper] [Code] [Project] [Poster]