DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.83k stars 263 forks source link
blip2 cross-modal-pretraining large-language-models llama minigpt4 multi-modal-chatgpt video-language-pretraining vision-language-pretraining

Video-LLaMA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.

News

Video-LLaMA

Introduction

Example Outputs

Pre-trained & Fine-tuned Checkpoints

The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former, and linear projection layers) only.

The following checkpoints are the full weights (visual encoder + audio encoder + Q-Formers + language decoder) to launch Video-LLaMA:

Checkpoint Link Note
Video-LLaMA-2-7B-Pretrained link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
Video-LLaMA-2-7B-Finetuned link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
Video-LLaMA-2-13B-Pretrained link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
Video-LLaMA-2-13B-Finetuned link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat

Usage

Environment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda env create -f environment.yml
conda activate videollama

Prerequisites

Before using the repository, make sure you have obtained the following checkpoints:

DON'T have to do anything now!!

How to Run Demo Locally

Firstly, set the llama_model (for the path to the language decoder), imagebind_ckpt_path (for the path to the audio encoder), ckpt (for the path to VL branch) and ckpt_2 (for the path to AL branch) in eval_configs/video_llama_eval_withaudio.yaml accordingly. Then run the script:

python demo_audiovideo.py \
    --cfg-path eval_configs/video_llama_eval_withaudio.yaml \
    --model_type llama_v2 \ # or vicuna
    --gpu-id 0

Training

The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,

  1. Pre-training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset.

  2. Fine-tuning using the image-based instruction-tuning data from MiniGPT-4/LLaVA and the video-based instruction-tuning data from VideoChat.

1. Pre-training

Data Preparation

Download the metadata and video following the instructions from the official Github repo of Webvid. The folder structure of the dataset is shown below:

|webvid_train_data
|──filter_annotation
|────0.tsv
|──videos
|────000001_000050
|──────1066674784.mp4
|cc3m
|──filter_cap.json
|──image
|────GCC_train_000000000.jpg
|────...

Script

Config the checkpoint and dataset paths in visionbranch_stage1_pretrain.yaml and audiobranch_stage1_pretrain.yaml respectively. Then, run the script:

conda activate videollama
# for pre-training VL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage1_pretrain.yaml

# for pre-training AL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage1_pretrain.yaml

2. Instruction Fine-tuning

Data

For now, the fine-tuning dataset consists of:

Script

Config the checkpoint and dataset paths in visionbranch_stage2_pretrain.yaml and audiobranch_stage2_pretrain.yaml respectively. Then, run the following script:

conda activate videollama
# for fine-tuning VL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/visionbranch_stage2_finetune.yaml

# for fine-tuning AL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage2_finetune.yaml

Recommended GPUs

Acknowledgement

We are grateful for the following awesome projects our Video-LLaMA arising from:

The logo of Video-LLaMA is generated by Midjourney.

Term of Use

Our Video-LLaMA is just a research preview intended for non-commercial use only. You must NOT use our Video-LLaMA for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{damonlpsg2023videollama,
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  year = 2023,
  journal = {arXiv preprint arXiv:2306.02858},
  url = {https://arxiv.org/abs/2306.02858}
}