The llava-onevision model video inference code has an error

AmazDeng commented 1 month ago

For the llava-onevision model, the official video inference code does not modify the image_aspect_ratio parameter, resulting in the use of the default anyres_max_9. This causes the image_features to occupy a huge amount of GPU memory during inference. Is this problematic? After all, the paper states that each frame consists of 196 tokens, but using anyres_max_9 results in a number of tokens per frame far exceeding 196. Relevant links are as follows:

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb https://github.com/LLaVA-VL/LLaVA-NeXT/issues/142

Additionally, why can't I see the logic for each frame corresponding to 196 tokens in the code?

AmazDeng commented 1 month ago

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

kcz358 commented 1 month ago

I think there is a small error in the jupyter notebook. Passing modalities=['video'] should lower the token usage

Luodian commented 1 month ago

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

Sorry we found that we wrongly added some video specific logics in our llava_arch.py in commit https://github.com/LLaVA-VL/LLaVA-NeXT/commit/c121c20c541220e79add65949bbbf7b9febde602.

Now we revert it and please try with updated code, thanks!

AmazDeng commented 1 month ago

@Luodian @ZhangYuanhan-AI @kcz358 Thank you for your response. I need to point out that the reason for the excessively high GPU memory usage during video inference is that after the process_images method completes, the image_tensors dimensions are extremely large. For a single image, the dimensions are [16, 3, 384, 384], and for 32 frames, it becomes [512, 3, 384, 384]. This issue occurs at the stage where image_tensors = process_images(video_frames, image_processor, model.config) is executed, not during the generate stage. Therefore, even if you pass modalities=["video"] during the generate stage, it doesn’t help.

The reason the first dimension of the process_images output for a single image is 16 is that image_aspect_ratio="anyres_max_9".The "anyres_max_9" parameter is applicable to single image inference, not to video inference. I tested this using the latest code you modified, and the result is the same. GPU memory usage is still very high (about 57GB for 24 frames). The generated tensor does not have a shape of 196. So, does the process_images method also need some modifications?

inference code

import argparse
import torch
import sys
# print(f"before,sys.path============={sys.path}")
sys.path.append("/media/star/8T/PycharmProjects/github/gpt/LLaVA-NeXT")
# print(f"after,sys.path============={sys.path}")
import time

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings

warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "/media/star/8T/model/gpt/llava/llava-next/lmms-lab/llava-onevision/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()

# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/media/star/8T/tmp/gpt4v/video/zouxiu2_5/clip_135_140.mp4"
num_frames=24

print(f"num_frames={num_frames}")
video_frames = extract_frames(video_path,num_frames=num_frames)
print(f"model.config={model.config}")
image_tensors = process_images(video_frames, image_processor, model.config)

image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
print(f"image_tensors.shape={[image_tensor.shape for image_tensor in image_tensors]}")
# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nIs the model changing clothes in the video? answer the question using a single word or phrase."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
print(f"image_sizes={image_sizes[:2]}")
# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

kcz358 commented 1 month ago

The 729 dimension Hi, I think it is correct because in the encode image function there is no pooling operation. The pooling operation will be after the encode images in the get 2d pool and after this I think it will turn into 196 dim.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/16dbbb395d2ba8a347caec6ab391dfa19c5f1e0b/llava/model/llava_arch.py#L254-L258

For the wrong process images for video

Yes, I agree with you. There is an error in the tutorial again. Should not use the process_image in the mm_utils, it is using much more tokens then expected for frames.

You should use image processor to handle the frame instead

image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().cuda()

Thank you for pointing it out and I will check with it later and revise the notebook

AmazDeng commented 1 month ago

image_features.append(self.get_2dPool(image_feat))

The code you referenced is located in the encode_multimodals method. However, in the main branch of the llava_arch.py code, encode_multimodals is commented out. @kcz358

kcz358 commented 1 month ago

These lines contain the processing logic, not the encode_multimodals.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/3fbf54b4dbd72a060104253e6f08168df48f6625/llava/model/llava_arch.py#L232-L236

AmazDeng commented 1 month ago

These lines contain the processing logic, not the encode_multimodals.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/3fbf54b4dbd72a060104253e6f08168df48f6625/llava/model/llava_arch.py#L232-L236

I did as you said and replaced "process_images" with "image_processor". I printed out the shape after the statement "image_features.append(self.get_2dPool(image_feat))", but still no 196 appeared.

I am using the llava-onevision-qwen2-7b-ov version and conducted both local and online tests on the same video (https://llava-onevision.lmms-lab.com/). The results were "yes" and "no," respectively. The prompt was "Is the model changing clothes in the video? Answer the question using a single word or phrase." Clearly, the online result was correct, the local result is was wrong. Therefore, I think there are still some issues with the code.

kcz358 commented 1 month ago

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get

All the video frames are being pooled correctly. Hope it would help

Luodian commented 1 month ago

Thank you Kaichen, it's great to see the problem has been addressed, also tested my side and it works.

AmazDeng commented 1 month ago

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get

All the video frames are being pooled correctly. Hope it would help

I understand now. In my original approach, I only passed in [video], so the video only read a single frame. The subsequent frames were all processed as images.

hulianyuyy commented 1 month ago

Many thanks for your question. In the toturial, it works normally. But in the video inference code upon evaluation benchmarks, would it still incur huge memory costs?

kcz358 commented 1 month ago

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

hulianyuyy commented 1 month ago

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

Thanks, the lmms_eval evaluation logic is correct. But when i evaluate with 7b modal, it still incurs ~70GB memory, which is too huge as LLAVA-Next-Video-7B only occupies ~20GB memory. Maybe there is still something wrong with the inference code?

Luodian commented 1 month ago

I did a quick test, it runs in 20GB.

My script is:

FINAL_RUN_NAME=$1
TASKS=$2

MODEL_BASENAME=$(basename "$FINAL_RUN_NAME")

echo "MODEL_BASENAME: ${MODEL_BASENAME}"
cd /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval

python3 -m accelerate.commands.launch --num_processes 8 --main_process_port 12399 lmms_eval \
    --model llava_onevision \
    --model_args pretrained=${FINAL_RUN_NAME},conv_template=qwen_1_5,model_name=llava_qwen \
    --tasks ${TASKS} \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix ${MODEL_BASENAME} \
    --output_path ./logs

bash /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval/scripts/llava_one_vision/ov_eval.sh lmms-lab/llava-onevision-qwen2-7b-ov videomme;

hulianyuyy commented 4 weeks ago

Thanks for your reply. I will try it.

LLaVA-VL / LLaVA-NeXT

The llava-onevision model video inference code has an error #144