Closed zucchini-nlp closed 2 months ago
Hi @zucchini-nlp , may I ask what is your token size when you printed out in your notebook? In #144 , I printed out the tokens size and it seems that all frames have been pooled.
@kcz358 it is 197 tokens per frame if I don't preprocess frames in anyres. I guess it should be exactly 196, right? And in that case we shouldn't be appending the newline
token to videos?
I am now trying to make sense of how videos work since I am working on adding the model to transformers, thanks
@zucchini-nlp , I think there will be one newline token at the end of all video frames instead of one image_newline token at the end of each frame. The place I print in #144 is after the pooling but before the concat of image_newline. May I ask how did you process your video frames? If you provide the model frame by frame then it is likely you get 197 tokens per frame and multiple video
modalities need to be provided. But I think if you provide the video in one batch then should be 196 tokens per frame and 1 image token at the end of all tokens
@kcz358 Right, I was providing it frame by frame using the demo notebook, but if I change it to one tensor per video as follows, it works as you described. Thanks!
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import cv2
import numpy as np
import requests
import copy
import torch
import sys
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
cap = cv2.VideoCapture(video_path)
frames = []
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
for i in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, i)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(Image.fromarray(frame))
cap.release()
return frames
# Load and process video
video_path = "/raid/raushan/karate.mp4"
video_frames = extract_frames(video_path)
image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
print(image_tensor.shape)
cont = model.generate(
input_ids,
images=[image_tensor],
do_sample=False,
temperature=0,
top_p=1.0,
max_new_tokens=4096,
modalities=["video"]
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the
modality
list holds values for each prompt, while videos can contain several frames. This PR replicates themodality
list by copying it for all video frames in the demo notebookI tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.