π Blog | π Paper | π€ Hugging Face | π₯ Demo
(Left) The performance and max frames of different models.
(Right) Results on Needle-in-a-haystack evaluation on a single 80G GPU.
β¨ Highlights:
(i) Comprehensive long video understanding. Video-XL 7B achieves the leading performance among 7B models on MLVU, VideoMME, VNBench and LongVideoBench.
(ii) Efficient Long visual context processing. Video-XL can process 2048 frames on an 80G GPU and achieves nearly 95% accuracy on Needle-in-a-haystack evaluation.
(iii) Video-XL shows strong ability in some real-world scenarios, like movie summarization, surveillance anomaly detection and Ad placement identification.
Please download our pre-trained and finetuned model weights from the link
conda create -n videoxl python=3.10 -y && conda activate videoxl
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e "videoxl/.[train]"
pip install packaging && pip install ninja && pip install flash-attn --no-build-isolation --no-cache-dir
pip install -r requirements.txt
bash scripts/pretrain.sh
You can only utilize single image training data to efficiently train
bash scripts/finetune_i.sh
or use single image/multi-image/video data to get better performance
bash scripts/finetune_v.sh
For MLVU, Video-MME, LongVideoBench evaluation, please use lmms-eval
After installing lmms-eval
and videoxl, you can use the following script to evaluate.
First, put the video_xl.py
in lmms-eval/lmms_eval/models. Then add "video_xl" in lmms-eval/lmms_eval/models/init.py. Lastly, run the following code.
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model videoxl \
--model_args pretrained=videoxl_checkpoint_15000,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=128,video_decode_backend=decord\
--tasks videomme \
--batch_size 1 \
--log_samples \
--log_samples_suffix videoxl \
--output_path ./logs/
For VNBench evaluation, download VNBench and use the following script
bash eval/eval_vnbench.sh
To be coming soon
Please refer to train_samples so you can finetune with your own image or video data. We will realse our trainiing data in the near future!
If you find this repository useful, please consider giving a star :star: and citation
@article{shu2024video,
title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2409.14485},
year={2024}
}
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.