LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.88k stars 245 forks source link

Fix: videos in LLaVa-OV #195

Closed zucchini-nlp closed 2 months ago

zucchini-nlp commented 2 months ago

Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the modality list holds values for each prompt, while videos can contain several frames. This PR replicates the modality list by copying it for all video frames in the demo notebook

I tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.

kcz358 commented 2 months ago

Hi @zucchini-nlp , may I ask what is your token size when you printed out in your notebook? In #144 , I printed out the tokens size and it seems that all frames have been pooled.

zucchini-nlp commented 2 months ago

@kcz358 it is 197 tokens per frame if I don't preprocess frames in anyres. I guess it should be exactly 196, right? And in that case we shouldn't be appending the newline token to videos?

I am now trying to make sense of how videos work since I am working on adding the model to transformers, thanks

kcz358 commented 2 months ago

@zucchini-nlp , I think there will be one newline token at the end of all video frames instead of one image_newline token at the end of each frame. The place I print in #144 is after the pooling but before the concat of image_newline. May I ask how did you process your video frames? If you provide the model frame by frame then it is likely you get 197 tokens per frame and multiple video modalities need to be provided. But I think if you provide the video in one batch then should be 196 tokens per frame and 1 image token at the end of all tokens

zucchini-nlp commented 2 months ago

@kcz358 Right, I was providing it frame by frame using the demo notebook, but if I change it to one tensor per video as follows, it works as you described. Thanks!

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import cv2
import numpy as np
import requests
import copy
import torch

import sys
import warnings

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/raid/raushan/karate.mp4"
video_frames = extract_frames(video_path)
image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
print(image_tensor.shape)

cont = model.generate(
    input_ids,
    images=[image_tensor],
    do_sample=False,
    temperature=0,
    top_p=1.0,
    max_new_tokens=4096,
    modalities=["video"]
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)