LLaVa-NeXT-Video is added to 🤗 Transformers!

Hey all!

The video models are all supported in Transformers now and will be part of the v4.42 release. Feel free to check out the model checkpoints here.

To get the model, update transformers by running: !pip install --upgrade git+https://github.com/huggingface/transformers.git. Inference with videos can be done as follows:

import av
import numpy as n
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf",  device_map='auto')

video_path = "YOUR-LOCAL-VIDEO-PATH
container = av.open(video_path)

# sample uniformly 8 or more frames, depending on length of video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

# Prepare a chat formatted input
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.9)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

Useful links: Colab for inference Colab for fine-tuning Transformers docs

LLaVA-VL / LLaVA-NeXT

LLaVa-NeXT-Video is added to 🤗 Transformers! #79