Momentor (ICML 2024)

The official repository of paper Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning.

Momentor Overview

Momentor is a Video-LLM designed for fine-grained comprehension and localization in videos. It is composed of a frame encoder, a linear projection layer, a Temporal Perception Module (TPM), and a Large Language Model (LLM). We carefully design the Temporal Perception Module (TPM) to improve fine-grained temporal modeling and representation. Architecture and training of Momentor are shown in the following figure.

Installation

Git clone our repository and creating conda environment:

cd Momentor/momentor
conda create --name=momentor python=3.10
conda activate momentor
pip install -r requirements.txt

Training

For training instructions, check out train_momentor.md.

Moment-10M

We present Moment-10M, a large-scale video instruction dataset with segment-level annotation. We use videos from YTTemporal-1B to construct Moment-10M. We propose an automatic data generation engine to extract instance and event information from these videos and generate segment-level instruction following data. We meticulously design 5 single-segment tasks and 3 cross-segment tasks, which enables Video-LLMs perform comprehensive segment-level reasoning.

We are releasing our Moment-10M dataset, you can download it from the following links: part1, part2.

You can also download the data for Grounded Event-Sequence Modeling here: GESM.

After downloading and extracting the dataset to obtain the data files, you can use convert_data.py to transform the data into a text dialogue format and download_videos.py to download the corresponding video files. The usage for these scripts is as follows:

python convert_data.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

python download_videos.py --source_path <path_to_data_file> --video_path <path_to_store_videos>

Parameters:

--source_path: The path to the input data file containing identifiers for the videos.
--video_path: The path where the downloaded video files will be stored.

For GESM data extraction, use convert_data_gesm.py as follows:

python convert_data_gesm.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

Citation

If you found our work useful in your research, please consider giving this repository a star and citing our paper as followed:

@misc{qian2024momentor,
      title={Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning}, 
      author={Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang},
      year={2024},
      eprint={2402.11435},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgment

Thanks to the open source of the following projects:

DCDmllm / Momentor

readme

Momentor (ICML 2024)

Momentor Overview

Installation

Training

Moment-10M

Citation

Acknowledgment