Yicheng Xiao1*, Zhuoyan Luo1*, Yong Liu1, Yue Ma1, Hengwei Bian2, Yatai Ji1, Yujiu Yang1 and Xiu Li1
1 Tsinghua University, 2 Carnegie Mellon University
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
QVHighlights : The data is set as followed, you need to replace the feat_root path in the bash file with your own. You can download the official QVHighlight dataset from moment_detr_features.tar.gz.
QVHighlight
└──── features
├── slowfast_features
├── clip_text_features
├── clip_features
├── pann_features
└── clip_sub_features
conda create -n uvcom python=3.7
conda activate uvcom
# Install pytorch
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
# Install other packages
pip install -r requirements.txt
Tips: If you want to reproduce 100%, it is necessary to follow the version I provided and run it on RTX3090 ! !
Extra Training Data |
Use Audio |
Set Split | MR R1@0.5 |
MR R1@0.7 |
MR mAP | HD mAP | HD HIT@1 | Log/ckpt |
---|---|---|---|---|---|---|---|---|
✗ | ✗ | Val | 65.10 | 51.81 | 45.79 | 40.03 | 63.29 | log/ckpt |
✗ | ✗ | Test | 63.55 | 47.47 | 43.18 | 39.74 | 64.20 | log/ckpt |
✗ | ✔ | Test | 63.18 | 48.70 | 43.27 | 39.79 | 64.79 | --/-- |
ASR | ✗ | Test | 64.53 | 48.31 | 43.80 | 39.98 | 65.58 | --/-- |
Extra Training Data |
Use Audio |
Set Split | MR R1@0.5 |
MR R1@0.7 |
Log/ckpt |
---|---|---|---|---|---|
✗ | ✗ | Test | 59.25 | 36.64 | log/ckpt |
QVHighlights
bash scripts/train_QV_scratch.sh
You need to modify the relevant path to your own.
QVHighlights
bash scripts/eval_QV_scratch.sh
You need to modify the resume
ckpt path to your own.
Code in this repository is built upon several public repositories. Thanks for the wonderful work Moment-DETR and QD-DETR ! !
If you find this work useful for your research, please cite:
@article{DBLP:journals/corr/abs-2311-16464,
author = {Yicheng Xiao and
Zhuoyan Luo and
Yong Liu and
Yue Ma and
Hengwei Bian and
Yatai Ji and
Yujiu Yang and
Xiu Li},
title = {Bridging the Gap: {A} Unified Video Comprehension Framework for Moment
Retrieval and Highlight Detection},
journal = {CoRR},
volume = {abs/2311.16464},
year = {2023}
}
Our codes are under MIT license.