PKU-YuanGroup / Video-LLaVA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Apache License 2.0
2.66k stars 192 forks source link
instruction-tuning large-vision-language-model multi-modal

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.
[![hf_space](🤗-Open%20In%20Spaces-blue.svg)]( [![Open in OpenXLab](]( [![Studios](]( [![Replicate demo and cloud API](]( [![arXiv](](
[![License](]( [![Hits](]( [![GitHub issues](]( [![GitHub closed issues](](
[![zhihu](]( [![zhihu](]( [![zhihu](]( [![zhihu](量子位-000000?logo=wechat&logoColor=07C160)]( [![zhihu](新智元-000000?logo=wechat&logoColor=07C160)]( [![zhihu](知乎-000000?logo=zhihu&logoColor=0084FF)]( [![zhihu](](
💡 I also have other video-language projects that may interest you ✨.

> [**Open-Sora-Plan**](
[![github](]( [![github](](
> [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](
> Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan
[![github](]( [![github](]( [![arXiv](](
> [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](
> Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
[![github](]( [![github](]( [![arXiv](](

## 📰 News * **[2024.05.15]** 🤝🤝🤝 Thanks to the generous contributions of [@zucchini-nlp](, Video-LLaVa now available in the Transformers library! More details [here]( * **[2024.01.27]** 👀👀👀 Our [MoE-LLaVA]( is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters. * **[2024.01.17]** 🔥🔥🔥 Our [LanguageBind]( has been accepted at ICLR 2024! * **[2024.01.16]** 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking [](scripts/v1_5/ * **[2023.11.30]** 🤝 Thanks to the generous contributions of the community, the [OpenXLab's demo]( is now accessible. * **[2023.11.23]** We are training a new and powerful model. * **[2023.11.21]** 🤝 Check out the [replicate demo](, created by [@nateraw](, who has generously supported our research! * **[2023.11.20]** 🤗 [Hugging Face demo]( and **all codes & datasets** are available now! Welcome to **watch** 👀 this repository for the latest updates. ## 😮 Highlights Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset. ### 💡 Simple baseline, learning united visual representation by alignment before projection - With **the binding of unified visual representations to the language feature space**, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously. ### 🔥 High performance, complementary learning with video and image - Extensive experiments demonstrate **the complementarity of modalities**, showcasing significant superiority when compared to models specifically designed for either images or videos. ## 🤗 Demo ### Gradio Web UI Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide [online demo]( in Huggingface Spaces. ```bash python -m videollava.serve.gradio_web_server ``` ### CLI Inference ```bash CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit ``` ```bash CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit ``` ## 🚀 Main Results ### Image understanding

### Video understanding

## 🛠️ Requirements and Installation * Python >= 3.10 * Pytorch == 2.0.1 * CUDA Version >= 11.7 * Install required packages: ```bash git clone cd Video-LLaVA conda create -n videollava python=3.10 -y conda activate videollava pip install --upgrade pip # enable PEP 660 support pip install -e . pip install -e ".[train]" pip install flash-attn --no-build-isolation pip install decord opencv-python git+ ``` ## 🤖 API > [!Warning] >
> > 🚨 Upgrade transformers for quick access. > >
``` pip install -U transformers ``` If you need to install `av` then do ``` python -m pip install av ``` ``` import av import numpy as np from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration def read_video_pyav(container, indices): frames = [] start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf") processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf") prompt = "USER: