Batch inference for LLaVA One Vision

JiayiGuo821 commented 3 months ago

Is there a feasible way to conduct batch inference with LLaVA One Vision?

Luodian commented 3 months ago

I think most viable way is to use sglang's batch_run interface.

https://github.com/EvolvingLMMs-Lab/sglang/tree/dev/onevision_main

after launching the backend service, run this file.

https://github.com/EvolvingLMMs-Lab/sglang/blob/dev/onevision_main/examples/quick_start/srt_example_llava.py

You can use the early feature and see the PR.

https://github.com/sgl-project/sglang/pull/1123

We will update both our repo and sglang side to make sure everyone can use it more convienient.

ehayeshaiper commented 2 months ago

Hi @Luodian Have you integrated it with SGLang frontend inference? I'm guessing the image and video preprocessing isn't done correctly I can't get it to work:


def image_qa(s, image_path, question):
    s += sgl.user(sgl.image(image_path) + question)
    s += sgl.assistant(sgl.gen("answer"))

def single():
    state = image_qa.run(
        image_path="images/cat.jpeg", question="What is this?", max_new_tokens=128
    )
    print(state, "\n")

dshatwell23 commented 2 months ago

This is an alternative solution without using SGLang. I started with the sample code from HF (https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) and did two key changes.

First, after creating the model, you need to change the configuration from right padding to left padding

tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left'  # Use left padding for batch processing

Second, you need to create a list of strings showing the modality of each element in the batch. In my case I just used images. Then, pass the list to the model's generate method.

modalities = ["image" for _ in images]          # Repeat modality for every image in the batch

cont = model.generate(
    input_ids_repeated,
    images=image_tensor,
    image_sizes=image_sizes,
    modalities=modalities,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

Here is the complete script:

from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

from PIL import Image
import copy
import torch
import os

import warnings

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left'  # Use left padding for batch processing

model.eval()

images_dir = "test-images"
image_files = [file for file in os.listdir(images_dir) if file.endswith('.jpg')]
images = []
for file in image_files:
    image_path = os.path.join(images_dir, file)
    images.append(Image.open(image_path))
image_tensor = process_images(images, image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nProvided there is sufficient geographical and temporal information, provide a short description of the image based on the buildings, weather, objects and environment."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
input_ids_repeated = input_ids.repeat(len(images), 1)

image_sizes = [image.size for image in images]
modalities = ["image" for _ in images]          # Repeat modality for every image in the batch

cont = model.generate(
    input_ids_repeated,
    images=image_tensor,
    image_sizes=image_sizes,
    modalities=modalities,                   # Added this line with the modalities
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)

for file, output in zip(image_files, text_outputs):
    print(f"\n{file}: {output}")

HenryJunW commented 2 months ago

@dshatwell23 Thanks for your template. Do we need to resize the images to be the same dimension inside each batch?

dshatwell23 commented 2 months ago

@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.

HaozheZhao commented 3 weeks ago

@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.

I an wondering about how do you perform the padding for the input tokens?

LLaVA-VL / LLaVA-NeXT

Batch inference for LLaVA One Vision #169