LLaVA-Hound:
Video Large Multimodal Models from Large-scale Training

Official implementation for paper:

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Improve Vision Language Model Chain-of-thought Reasoning

Release

[10/30] Following requests, release 50k raw training videos of activityNet
[10/22] Related work on VLM CoT Reasoning with distillation, sft and RL. LLaVA-Reasoner-DPO
[4/14] Video SFT Data and script
[4/3] DPO 17k data + training script, pre-training video 900k + image 650k
[4/2] Project page set up, paper preprint, Test data pipeline

Dataset and Model

In Huggingface Repo, we release

Datasets:

Test data: ShareGPTVideo/test_video_and_instruction
- original videos are released at ShareGPTVideo/test_raw_video_data in case of need.
Train data ShareGPTVideo/train_video_and_instruction:
- 900k detailed caption caption,
- 900k frames data: 300k for finetuning, plus the rest 600k, in total 900k for pre-training.
- video qa data: 900k qa, and 240k subset used in our experiments.
- video instruction data for sft: we provide image instruction, mix-up video caption and qa for sft, see sft training for usage.

Models:

Pre-trained ckpt on large scale video (and image) caption: ShareGPTVideo/LLaVA-Hound-Pretrain
Fine-tuned ckpt on video (and image) instruction: ShareGPTVideo/LLaVA-Hound-SFT
DPO ckpt with 17k video preference data: ShareGPTVideo/LLaVA-Hound-DPO
Additionaly, ShareGPTVideo/LLaVA-Hound-SFT-Image_only
Setup:
```
# setup requirements
source setup/setup_env.sh
```

need to fill in required path and API tokens at

set_path.sh


# Inference Example for DPO/SFT Model
```bash
cd llava_hound_dpo
sudo apt-get install ffmpeg

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame

video_path = "examples/sample_msrvtt.mp4"

# options ["ShareGPTVideo/LLaVA-Hound-DPO", "ShareGPTVideo/LLaVA-Hound-SFT", "ShareGPTVideo/LLaVA-Hound-SFT-Image_only"]
model_path = "ShareGPTVideo/LLaVA-Hound-DPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Inference Example for Detailed Caption Model

To generate detailed video captions with our pretrained ckpt use

import numpy as np
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame, detail_templates

video_path = "examples/sample_msrvtt.mp4"

model_path = "ShareGPTVideo/LLaVA-Hound-Pretrain"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

question = np.random.choice(detail_templates) # use pretrained template questions

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Testing with one-line command

# setup data
source setup/setup_test_data.sh

# Eval for official (a subset of 5k qa)
bash test/pipeline/outdomain_official_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our in-domain
bash test/pipeline/indomain_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our out-of-domain 
bash test/pipeline/outdomain_test_pipeline.sh \
$model_output_name \
$model_name

Exampe of official testing with dpo model

bash test/pipeline/outdomain_official_test_pipeline.sh \
videollava_dpo \
ShareGPTVideo/LLaVA-Hound-DPO

More details including discussion, other SOTA model testing, customized model testing, refer to test readme

Training

DPO training refer to DPO data setup and training

Pretrain + SFT refer to Pretrain + SFT

Reference

@article{zhang2024direct,
  title={Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward},
  author={Zhang, Ruohong and Gui, Liangke and Sun, Zhiqing and Feng, Yihao and Xu, Keyang and Zhang, Yuanhan and Fu, Di and Li, Chunyuan and Hauptmann, Alexander and Bisk, Yonatan and others},
  journal={arXiv preprint arXiv:2404.01258},
  year={2024}
}

Acknowledgement

Code is build updo the following projects:

Video-LLaVA as the LMM architecture
trl for DPO implementation

Thanks for their great work!

RifleZhang / LLaVA-Hound-DPO

readme

LLaVA-Hound: Video Large Multimodal Models from Large-scale Training