LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.93k stars 252 forks source link

LLaVA-NeXT: Open Large Multimodal Models

Static Badge Static Badge llava_next-blog

llava_onevision-demo llava_next-video_demo llava_next-interleave_demo Openbayes Demo

llava_video-checkpoints llava_onevision-checkpoints llava_next-interleave_checkpoints llava_next-image_checkpoints

Release Notes

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama-1/2 community license for LLaMA-2 and Vicuna-v1.5, Tongyi Qianwen RESEARCH LICENSE AGREEMENT and Llama-3 Research License). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT

2. Install the inference package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

Project Navigation

Please checkout the following page for more inference & evaluation details.

- LLaVA-OneVision: Easy Task Transfer

- LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

- LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

- LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models

SGLang for SpeedUp Inference and Deployment

We use SGLang to speed up inference and deployment of LLaVA-NeXT. You could make LLaVA-NeXT as a backend API service with SGLang.

Prepare Environment: Following the instruction in the sglang

LLaVA-NeXT/OneVision

Checkout the HTTP Post/Get and SRT usage at sglang/examples/runtime/llava_onevision

LLaVA-NeXT (Video)

Launch and Run on (K) Nodes:

Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@article{li2024llava,
  title={LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models},
  author={Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan},
  journal={arXiv preprint arXiv:2407.07895},
  year={2024}
}

@misc{li2024llavanext-ablations,
    title={LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?},
    url={https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/},
    author={Li, Bo and Zhang, Hao and Zhang, Kaichen and Guo, Dong and Zhang, Yuanhan and Zhang, Renrui and Li, Feng and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}

@misc{li2024llavanext-strong,
    title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
    url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
    author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}

@misc{zhang2024llavanext-video,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

Acknowledgement

Related Projects

For future project ideas, please check out: