IVGSZ / Flash-VStream

This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
https://invinciblewyq.github.io/vstream-page/
Apache License 2.0
132 stars 8 forks source link

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Haoji Zhang*, Yiqin Wang*, Yansong Tang †, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin†‡

* Equally contributing first authors, †Correspondence, ‑Project Lead

Work done when interning at Bytedance.

PWC

PWC

PWC

PWC

We presented Flash-VStream, a noval LMM able to process extremely long video streams in real-time and respond to user queries simultaneously.

We also proposed VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.

News

Contents

Install

Please follow the instructions below to install the required packages.

  1. Clone this repository

  2. Install Package

    conda create -n vstream python=3.10 -y
    conda activate vstream
    cd Flash-VStream
    pip install --upgrade pip
    pip install -e .
  3. Install additional packages for training cases

    pip install ninja
    pip install flash-attn --no-build-isolation

Model

We provide our Flash-VStream models after Stage 1 and 2 finetuning:

Model Weight Initialized from LLM Initialized from ViT
Flash-VStream-7b Flash-VStream-7b lmsys/vicuna-7b-v1.5 openai/clip-vit-large-patch14

Preparation

Dataset

Image VQA Dataset. Please organize the training Image VQA training data following this and evaluation data following this. Please put the pretraining data, finetuning data, and evaluation data in pretrain, finetune, and eval_video folder following Structure.

Video VQA Dataset. please download the 2.5M subset from WebVid and ActivityNet dataset from official website or video-chatgpt.

If you want to perform evaluation, please also download corresponding files of ActivityNet-QA and NExT-QA-OE. You can download MSVD-QA and MSRVTT-QA from LLaMA-VID.

Meta Info. For meta info of training data, please download the following files and organize them as in Structure.

Training Stage Data file name Size
Pretrain llava_558k_with_webvid.json 254 MB
Finetune llava_v1_5_mix665k_with_video_chatgpt.json 860 MB

For meta info of evaluation data, please reformat each QA list to a json file named test_qa.json under Structure with format like this:

[
    {
        "video_id": "v_1QIUV7WYKXg",
        "question": "is the athlete wearing trousers",
        "id": "v_1QIUV7WYKXg_3",
        "answer": "no",
        "answer_type": 3,
        "duration": 9.88
    },
    {
        "video_id": "v_9eniCub7u60",
        "question": "does the girl in black clothes have long hair",
        "id": "v_9eniCub7u60_2",
        "answer": "yes",
        "answer_type": 3,
        "duration": 19.43
    },
]

Pretrained Weights

We recommend users to download the pretrained weights from the following link Vicuna-7b-v1.5, clip-vit-large-patch14, and put them in ckpt following Structure.

Feature Extraction

We recommend users to extract ViT features of training and evaluation data, which accelerates training and evaluating a lot. If you do so, just replace .mp4 with .safetensors in video filename and put them in image_features and video_features folder. If not, ignore the image_features and video_features folder.

We load video feature at fps=1 and arrange them in the time order.

Each .safetensors file should contain a dict like this:

{
    'feature': torch.Tensor() with shape=[256, 1024] for image and shape=[Length, 256, 1024] for video.
}

Structure

The folder structure should be organized as follows before training.

Flash-VStream
β”œβ”€β”€ checkpoints-finetune
β”œβ”€β”€ checkpoints-pretrain
β”œβ”€β”€ ckpt
β”‚   β”œβ”€β”€ clip-vit-large-patch14
β”‚   β”œβ”€β”€ vicuna-7b-v1.5
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ pretrain
β”‚   β”‚   β”œβ”€β”€ llava_558k_with_webvid.json
β”‚   β”‚   β”œβ”€β”€ image_features
β”‚   β”‚   β”œβ”€β”€ images
β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”œβ”€β”€ videos
β”‚   β”œβ”€β”€ finetune
β”‚   β”‚   β”œβ”€β”€ llava_v1_5_mix665k_with_video_chatgpt.json
β”‚   β”‚   β”œβ”€β”€ activitynet
β”‚   β”‚   β”œβ”€β”€ coco
β”‚   β”‚   β”œβ”€β”€ gqa
β”‚   β”‚   β”œβ”€β”€ image_features
β”‚   β”‚   β”‚   β”œβ”€β”€ coco
β”‚   β”‚   β”‚   β”œβ”€β”€ gqa
β”‚   β”‚   β”‚   β”œβ”€β”€ ocr_vqa
β”‚   β”‚   β”‚   β”œβ”€β”€ textvqa
β”‚   β”‚   β”‚   β”œβ”€β”€ vg
β”‚   β”‚   β”œβ”€β”€ ocr_vqa
β”‚   β”‚   β”œβ”€β”€ textvqa
β”‚   β”‚   β”œβ”€β”€ vg
β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ activitynet
β”‚   β”œβ”€β”€ eval_video
β”‚   β”‚   β”œβ”€β”€ ActivityNet-QA
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”‚   β”‚   β”œβ”€β”€ MSRVTT-QA
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”‚   β”‚   β”œβ”€β”€ MSVD-QA
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”‚   β”‚   β”œβ”€β”€ nextoe
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”‚   β”‚   β”œβ”€β”€ vstream
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”‚   β”‚   β”œβ”€β”€ vstream-realtime
β”‚   β”‚   β”‚   β”œβ”€β”€ video_features
β”‚   β”‚   β”‚   β”œβ”€β”€ test_qa.json
β”œβ”€β”€ flash_vstream
β”œβ”€β”€ scripts

Train

Flash-VStream is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus. If your GPUs have less than 80GB memory, you may try ZeRO-2 and ZeRO-3 stages.

Please make sure you download and organize the data following Preparation before training.

Like LLaVA, Flash-VStream has two training stages: pretrain and finetune. Their checkpoints will be saved in checkpoints-pretrain and checkpoints-finetune folder. These two stages will take about 15 hours on 8 A100 GPUs in total.

If you want to train Flash-VStream from pretrained LLM and evaluate it, please run the following command:

bash scripts/train_and_eval.sh

Evaluation

Please make sure you download and organize the data following Preparation before evaluation.

If you want to evaluate a Flash-VStream model, please run the following command:

bash scripts/eval.sh

Real-time CLI Inference

We provide a real-time CLI inference script, which simulates video stream input by reading frames of a video file at a fixed frame speed. You can ask any question and get the answer at any timestamp of the video stream. Run the following command and have a try:

bash scripts/realtime_cli.sh

VStream-QA Benchmark

Please download VStream-QA Benchmark following this repo.

Citation

If you find this project useful in your research, please consider citing:

@article{flashvstream,
      title={Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams}, 
      author={Haoji Zhang and Yiqin Wang and Yansong Tang and Yong Liu and Jiashi Feng and Jifeng Dai and Xiaojie Jin},
      year={2024},
      eprint={2406.08085},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

Code License

This project is licensed under the Apache-2.0 License.