LLaVA-VL / LLaVA-NeXT

1.01k stars 55 forks source link

LLaVa-NeXT-Video is added to 🤗 Transformers! #79

Open zucchini-nlp opened 4 days ago

zucchini-nlp commented 4 days ago

Hey all!

The video models are all supported in Transformers now and will be part of the v4.42 release. Feel free to check out the model checkpoints here.

To get the model, update transformers by running: !pip install --upgrade git+https://github.com/huggingface/transformers.git. Inference with videos can be done as follows:

import av
import numpy as n
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf",  device_map='auto')

video_path = "YOUR-LOCAL-VIDEO-PATH
container = av.open(video_path)

# sample uniformly 8 or more frames, depending on length of video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

# Prepare a chat formatted input
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.9)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

Useful links: Colab for inference Colab for fine-tuning Transformers docs

ZhangYuanhan-AI commented 4 days ago

Cool! Thanks a lot!!!