RifleZhang / LLaVA-Hound-DPO

122 stars 18 forks source link

LLaVA-Hound:
Video Large Multimodal Models from Large-scale Training

Official implementation for paper:

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Related:

Improve Vision Language Model Chain-of-thought Reasoning

Release

Dataset and Model

In Huggingface Repo, we release

Datasets:

  1. Test data: ShareGPTVideo/test_video_and_instruction
  2. Train data ShareGPTVideo/train_video_and_instruction:

Models:

  1. Pre-trained ckpt on large scale video (and image) caption: ShareGPTVideo/LLaVA-Hound-Pretrain
  2. Fine-tuned ckpt on video (and image) instruction: ShareGPTVideo/LLaVA-Hound-SFT
  3. DPO ckpt with 17k video preference data: ShareGPTVideo/LLaVA-Hound-DPO
  4. Additionaly, ShareGPTVideo/LLaVA-Hound-SFT-Image_only

    Setup:

    
    # setup requirements
    source setup/setup_env.sh

need to fill in required path and API tokens at

set_path.sh


# Inference Example for DPO/SFT Model
```bash
cd llava_hound_dpo
sudo apt-get install ffmpeg
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame

video_path = "examples/sample_msrvtt.mp4"

# options ["ShareGPTVideo/LLaVA-Hound-DPO", "ShareGPTVideo/LLaVA-Hound-SFT", "ShareGPTVideo/LLaVA-Hound-SFT-Image_only"]
model_path = "ShareGPTVideo/LLaVA-Hound-DPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Inference Example for Detailed Caption Model

To generate detailed video captions with our pretrained ckpt use

import numpy as np
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame, detail_templates

video_path = "examples/sample_msrvtt.mp4"

model_path = "ShareGPTVideo/LLaVA-Hound-Pretrain"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, model_name=model_name, cache_dir=os.environ['CACHE_DIR'])
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

question = np.random.choice(detail_templates) # use pretrained template questions

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

Testing with one-line command

# setup data
source setup/setup_test_data.sh

# Eval for official (a subset of 5k qa)
bash test/pipeline/outdomain_official_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our in-domain
bash test/pipeline/indomain_test_pipeline.sh \
$model_output_name \
$model_name

# Eval for our out-of-domain 
bash test/pipeline/outdomain_test_pipeline.sh \
$model_output_name \
$model_name

Exampe of official testing with dpo model

bash test/pipeline/outdomain_official_test_pipeline.sh \
videollava_dpo \
ShareGPTVideo/LLaVA-Hound-DPO

More details including discussion, other SOTA model testing, customized model testing, refer to test readme

Training

DPO training refer to DPO data setup and training

Pretrain + SFT refer to Pretrain + SFT

Reference

@article{zhang2024direct,
  title={Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward},
  author={Zhang, Ruohong and Gui, Liangke and Sun, Zhiqing and Feng, Yihao and Xu, Keyang and Zhang, Yuanhan and Fu, Di and Li, Chunyuan and Hauptmann, Alexander and Bisk, Yonatan and others},
  journal={arXiv preprint arXiv:2404.01258},
  year={2024}
}

Acknowledgement

Code is build updo the following projects:

Thanks for their great work!