VectorSpaceLab / Video-XL

πŸ”₯πŸ”₯First-ever hour scale video understanding models
Apache License 2.0
180 stars 11 forks source link

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

🌐 Blog | πŸ“ƒ Paper | πŸ€— Hugging Face | πŸŽ₯ Demo

(Left) The performance and max frames of different models.
(Right) Results on Needle-in-a-haystack evaluation on a single 80G GPU.

✨ Highlights:

(i) Comprehensive long video understanding. Video-XL 7B achieves the leading performance among 7B models on MLVU, VideoMME, VNBench and LongVideoBench.

(ii) Efficient Long visual context processing. Video-XL can process 2048 frames on an 80G GPU and achieves nearly 95% accuracy on Needle-in-a-haystack evaluation.

(iii) Video-XL shows strong ability in some real-world scenarios, like movie summarization, surveillance anomaly detection and Ad placement identification.

News

Model weights

Please download our pre-trained and finetuned model weights from the link

Installation

conda create -n videoxl python=3.10 -y && conda activate videoxl
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "videoxl/.[train]"
pip install packaging &&  pip install ninja && pip install flash-attn --no-build-isolation --no-cache-dir
pip install -r requirements.txt

Quick Start With HuggingFace

Example Code ```python from videoxl.model.builder import load_pretrained_model from videoxl.mm_utils import tokenizer_image_token, process_images,transform_input_id from videoxl.constants import IMAGE_TOKEN_INDEX,TOKEN_PERFRAME from PIL import Image from decord import VideoReader, cpu import torch import numpy as np # fix seed torch.manual_seed(0) model_path = "assets/videoxl_checkpoint-15000" video_path="assets/ad2_watch_15min.mp4" max_frames_num =900 gen_kwargs = {"do_sample": True, "temperature": 1, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024} tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="cuda:0") model.config.beacon_ratio=[8] # you can delete this line to realize random compression of {2,4,8} ratio #video input prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\nDoes this video contain any inserted advertisement? If yes, which is the content of the ad?<|im_end|>\n<|im_start|>assistant\n" input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device) vr = VideoReader(video_path, ctx=cpu(0)) total_frame_num = len(vr) uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int) frame_idx = uniform_sampled_frames.tolist() frames = vr.get_batch(frame_idx).asnumpy() video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16) beacon_skip_first = (input_ids == IMAGE_TOKEN_INDEX).nonzero(as_tuple=True)[1].item() num_tokens=TOKEN_PERFRAME *max_frames_num beacon_skip_last = beacon_skip_first + num_tokens with torch.inference_mode(): output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"],beacon_skip_first=beacon_skip_first,beacon_skip_last=beacon_skip_last, **gen_kwargs) if IMAGE_TOKEN_INDEX in input_ids: transform_input_ids=transform_input_id(input_ids,num_tokens,model.config.vocab_size-1) output_ids=output_ids[:,transform_input_ids.shape[1]:] outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(outputs) ```

Pre-training

bash scripts/pretrain.sh

Fine-tuning

You can only utilize single image training data to efficiently train

bash scripts/finetune_i.sh

or use single image/multi-image/video data to get better performance

bash scripts/finetune_v.sh

Long Video Benchmark Evaluation

For MLVU, Video-MME, LongVideoBench evaluation, please use lmms-eval After installing lmms-eval and videoxl, you can use the following script to evaluate.

First, put the video_xl.py in lmms-eval/lmms_eval/models. Then add "video_xl" in lmms-eval/lmms_eval/models/init.py. Lastly, run the following code.

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model videoxl \
    --model_args pretrained=videoxl_checkpoint_15000,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=128,video_decode_backend=decord\
    --tasks videomme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix videoxl \
    --output_path ./logs/
Expand to see the performance on Video-MME and MLVU

For VNBench evaluation, download VNBench and use the following script

bash eval/eval_vnbench.sh
Expand to see the performance on VNbench and LongVideoBench

Needle-in-a-haystack evaluation

To be coming soon

Training Data

Please refer to train_samples so you can finetune with your own image or video data. We will realse our trainiing data in the near future!

Citation

If you find this repository useful, please consider giving a star :star: and citation

@article{shu2024video,
  title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
  author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
  journal={arXiv preprint arXiv:2409.14485},
  year={2024}
}

Acknowledgement

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.