RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

TimeChat

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren*, Linli Yao*, Shicheng Li, Xu Sun, Lu Hou

News

Video-LLaMA

Introduction

Example Outputs

Fine-tuned Checkpoints

The following checkpoints store learnable parameters (positional embedding layers, Time-aware Frame Encoder, Sliding Video Q-Former, linear projection layers, and lora) only.

Checkpoint LLM backbone Link Note
TimeChat-2-7B-Finetuned LLaMA-2 7B link Fine-tuned on the instruction-tuning data from TimeIT-104K (asr version) and Valley-73K (previous version of current Valley-65K)

Usage

Enviroment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda env create -f environment.yml
conda activate timechat
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Prerequisites

Before fine-tuning your own model (or reproduce our TimeChat model), make sure you have obtained the following checkpoints:

Pre-trained Image Encoder (EVA ViT-g)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth

Pre-trained Image Q-Former (InstructBLIP Q-Former)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Use git-lfs to download weights of Video-LLaMA (7B):

git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned

Instruct-tuned TimeChat-7B

git lfs install
git clone https://huggingface.co/ShuhuaiRen/TimeChat-7b

The file structure looks like:

ckpt/
|–– Video-LLaMA-2-7B-Finetuned/
    |-- llama-2-7b-chat-hf/
    |-- VL_LLaMA_2_7B_Finetuned.pth
|–– instruct-blip/
    |-- instruct_blip_vicuna7b_trimmed.pth
|–– eva-vit-g/
    |-- eva_vit_g.pth
|-- timechat/
    |-- timechat_7b.pth

How to Run Demo Locally

Please refer to our Jupyter Demo here.

Instruction-Tuning

Data

For now, the fine-tuning dataset consists of:

Script

Tuning

Config the checkpoint and dataset paths in stage2_finetune_time104k_valley72k.yaml.

conda activate timechat
torchrun --nproc_per_node=8 train.py --cfg-path  train_configs/stage2_finetune_time104k_valley72k.yaml

Evaluation

Config the checkpoint and dataset paths in timechat.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Recommended GPUs

Acknowledgement

We are grateful for the following awesome projects our TimeChat arising from:

Term of Use

Our TimeChat is just a research preview intended for non-commercial use only. You must NOT use our TimeChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{Ren2023TimeChat,
  title={TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding},
  author={Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou},
  journal={ArXiv},
  year={2023},
  volume={abs/2312.02051},
}